blog

Validate the Unique Grain of Incoming ETL Data

02/10/2015

Data validation is critical with any ETL process, no matter what solution or software package is used. Declaring the grain of data is a major step in the design process. In other words, you should be able to describe what a single record in a table represents. For instance, “each record in this fact table represents a bank transaction at a given point in time for a customer.”

This blog post demonstrates how to use two BASE SAS procedures to check if data truly contains unique values at a specified grain. This is useful during an ETL process in SAS because appropriate steps can be taken to stop the process or prevent duplicate data from being loaded.

Identifying the Grain of Data

As a part of the design process for any data modeling exercise, it is important to understand the grain of data. With a dimensional model approach, declaring the grain is a critical step. As previously mentioned, you should be able to clearly describe what each record represents in a table. In some cases, this is straightforward and may just be the combination of a few fields in the table.

In other cases, it could get more complex, especially early in an ETL process, when working with raw data or cryptic data sources. From experience, I suggest always attempting to identify the grain of time. If there is no variable to represent a date or date-time, the data could be a “snapshot”. In cases where data is actually a “snapshot”, add a column to represent the “as of” date for the “snapshot” if there is not one already.

While this technically is not necessary for defining a unique primary key (because the additional column will only contain a single value that does not alter the uniqueness of data), it is useful to provide a logical understanding of the nature of time in the data.

The following sample data will be used to demonstrate how to confirm the uniqueness of data:

Proc Freq Validation Sample Transactions

This simple example of the dataset represents credit card transactions for a given account_number and transaction_date. Since there can be multiple transactions within a single day, we need to identify the true grain of data. Let’s confirm this by using PROC FREQ and PROC SUMMARY.

Using PROC FREQ to Validate the Grain of Data

PROC FREQ is one of the core procedures within SAS and can be used for many statistical functions. In this example, PROC FREQ is used to count the frequency of a specified aggregate slice, or crossing of the incoming source data. The slice of data, specified by the TABLES parameter, will be the columns which represent the unique grain of the data.

proc freq data=source;   
tables transaction_date account_number / noprint out=source_primary_key_count;
run; 

data err_duplicate_pk_values;   
set source_primary_key_count;  
where COUNT > 1;
run;

The previous code tested if the account_number and transaction_date represented the unique grain in the incoming source data. The output shown below indicates this is not the case. If any records exist in the err_duplicate_pk_values table, there is duplication at the specified grain (columns defined in the TABLES parameter.)

Multiple Crossings of Credit Card Transactions

Adding the trans_id (not shown above sample data) defines the true unique grain of data because there can be multiple transactions for a given account_number on a given transaction_date. The following code confirms this by adding the trans_id to the tables parameter of the PROC FREQ step.

proc freq data=source;  
tables transaction_date account_number trans_id 
   / noprint out=source_primary_key_count; 
run; 

data err_duplicate_pk_values;   
set source_primary_key_count;  
where COUNT > 1;
run;

Validate the Grain of Data

If SAS complains because the table is too large for PROC FREQ to process, PROC SUMMARY should be able to handle it. When working with “big data”, PROC FREQ and PROC SUMMARY will take much longer to complete. I would stick with PROC SUMMARY because it handles larger data volumes much better. Check out performance options to enhance or control this step of the process.

The validation process is very straightforward and similar to the previous example using PROC FREQ. Use the CLASS parameter of PROC SUMMARY to specify the columns which represent the unique grain of data.

proc summary data=source_primary_key nway missing threads;  
class transaction_date account_number trans_id;  
output out=source_primary_key_count(rename=_freq_=Count);
run;

data err_duplicate_pk_values;   
set source_primary_key_count;  
where COUNT > 1;
run;

In this example, the output dataset err_duplicate_pk_values contains no records to show, because the PROC SUMMARY output does not contain any observations with COUNT > 1. Therefore the combination of transaction_date, account_number, and trans_id represent the unique grain in the source data!

Validating Data During an ETL Process

The logic demonstrated in this blog post can be extended further to support an actual ETL process using SAS Macro Programming. Key input and output can be “macro-tized” to create a flexible macro function to guide the completion of an ETL process. If the defined unique grain of data fails to validate, the process should be aborted to prevent data quality issues further down the life of the ETL process.