Using SAS Data Integration Studio to Load Covid-19 Case Surveillance Data
03/24/2021 by Anthony Edmundson
As Covid-19 cases rise around the globe and soar in the U.S; Covid-19 surveillance programs are being implemented to track the number of cases. It has been reported that starting January 26, 2021 the CDC will require a negative Covid-19 test result for travelers to enter the U.S.
The ability to manage metadata in a centralized repository, the ability to handle the potential challenges of integrating data from multiple disparate sources, and the ability to handle the potential challenges of loading data into multiple target databases makes the SAS Data Integration Studio 4.9 an excellent choice for loading Covid-19 case surveillance data.
SAS Data Integration Studio eases the difficulty of extracting, transforming, and loading Covid-19 case surveillance data into common target databases.
Let us discuss a few things to keep in mind when designing and developing a SAS DI job to load Covid-19 case surveillance data.
Target data types will be defined as part of Step 1. One factor in determining the target data types is understanding how the field will be utilized. Will the field be used in a static report? Will the field be used in an ad-hoc query? Will the field be used to generate statistics?
A few common best practices for defining target data types are:
During the DI job testing the user may find that fields are not named or defined as designed or desired. As part of step 3 or step 6 it may be necessary to utilize the Impact Analysis and Reverse Impact Analysis tools found under Analyze.
The Impact Analysis and Reverse Impact Analysis tools can speed up the process of updating metadata for fields that are not named or defined as designed or desired.
The Impact Analysis tool provides a list of table, column, and job metadata objects downstream that reference the selected field. The downstream metadata objects will be affected by changes to the selected metadata object. Downstream objects are objects after the selected table metadata object.
The Reverse Impact Analysis tool provides a list of table, column, and job metadata objects upstream that reference the selected field. The upstream metadata objects will contribute to the selected metadata object. Upstream objects are objects before the selected table metadata object.
To use Impact Analysis, right click the mouse on the table metadata object in the DI job flow. Select Analyze.
Select the Impact Analysis Tab. Right click the mouse on the table name then select Analyze Columns.
Move the available column(s) that Impact Analysis should be performed on to the Selected Column list then press OK. A list of all the metadata objects downstream from the Sort Transformation containing the cdc_report_dt will be displayed.
To use Reverse Impact Analysis, right click the mouse on the table metadata object in the DI job flow. Select Analyze.
Select the Reverse Impact Analysis Tab. Right click the mouse on the table name then select Analyze Columns.
Move the available column(s) that Reverse Impact Analysis should be performed on to the Selected Column list then press OK. A list of all the metadata objects upstream from the Sort Transformation contributing to the cdc_report_dt will be displayed.
The Impact Analysis tool and Reverse Impact Analysis tool is extremely helpful when applying fixes to large DI jobs with many metadata objects. Using the Impact Analysis tool and Reverse Impact Analysis tool will decrease the amount of time the user spends ‘poking around’ looking for metadata objects it wishes to change.
When Include Mappings is checked DI Studio will attempt to automatically map fields between source and target objects.
There are two things to remember about automatically mapping fields between source and target objects.
In the screenshot below the sex column was automatically mapped to the sex column in target but a warning was generated. Also race_and_ethnicity should be mapped to race_and_ethnicity_combined but was not mapped because the column names are different.
The image below captures the completed DI job to loading Covid-19 case surveillance data from a CSV file downloaded from a data provider into a SAS Dataset.
For discussion purposes, we will cover loading Covid-19 case surveillance data into a SAS Dataset. The following steps will walk readers through the process for loading Covid-19 data.
Determine and define source data to be extracted. This includes data sets, flat files and the fields within each dataset/flat file, and the format of each field before movement. The frequency of data availability is also noted.
Determine and define target data to be loaded. This includes data sets, the fields within each dataset/table, and the format of each field after movement. The frequency of data transfer is also defined.
It should be noted that the source data for this discussion was downloaded from https://www.kaggle.com/arashnic/covid19-case-surveillance-public-use-dataset.
The Covid-19 Case Surveillance Public CSV file was downloaded and stored locally. The CSV file was registered to SAS metadata then the File Reader transformation was used to extract the data from the CSV file and load it into a temporary SAS dataset.
The mappings defined during design are applied on the Mapping Tab in the Table Loader transformation properties.
Map source fields to target destination fields.
The Table Loader transformation (4) maps the source fields from the Sort transformation (3) output table to the target fields in the Covid-19 Case Surveillance staging table.
Apply 1-to-1 mappings, transformations/conversions, formulas, or rules captured in the requirement and design.
The Table Loader loads the Covid-19 case surveillance data into the target table.
There are a few Load Styles: Replace, Append to Existing, and Update/Insert. This Table Loader is setup to (delete) and replace the Covid-19 case surveillance data table each time the DI job is run.
Test to verify if the job runs successfully. Ensure the the end-to-end data flow works correctly. Then, adjust and retest, as necessary.
Press the Run button <Insert Image 7> to test the DI job.
Open and review the output table loaded by the Table Loader transformation.
Once testing is complete, and it has been verified that data transformations are working as designed; deploy the DI job and schedule a migration or integration go-live.
Deploying the DI job creates a SAS program from the metadata objects in the deployed DI job data flow.
To deploy the DI job to run in batch; select the DI job. Click the right mouse button, select Scheduling then select Deploy…
Select the deployment directory to store the deployed SAS program generated from the DI job. The default directory setup during the SAS Installation and Configuration can be selected or a new deployment directory can be created.
Going forward the data integration job, the deployment, and/or the data mappings will require updates and changes as new data sources are added, as data sources change, or as targets change.
We touched on a few items to remember when designing and developing DI jobs. Then, we covered the common steps to extract Covid-19 data from a CSV file and load the data into a SAS dataset. I encourage readers to try the steps in your development environment. If you have questions or need assistance contact Zencos. We will be happy to help.