Tame your Big Data with SAS Coding Efficiencies
01/06/2020 by Ben Murphy Support
Working with big data comes with its own set of challenges. Large data sets with millions of rows and columns often take a very long time to process. The impact of this lengthy processing time is that often businesses have to wait long periods of time to retrieve the results. SAS programming language’s flexibility actually allows multiple ways to code a solution. However, often enough, the most obvious solution won’t be the most efficient one.
You’ve likely experienced code execution taking forever to run. In most cases, you set those jobs to run at night so as not to waste a day of work. But what if we could optimize performance and run the said jobs in a time period that’s more conducive to a day’s work?
Explore the following general coding efficiencies, learn how SAS stores data, and take a peek at the tactical technique you could use to work with the database.
There are generally accepted methods of programming efficiency which have been expanded on by other papers. The following programming techniques are mentioned in Lafler’s paper, but most often cited by others. These techniques are about reducing your overall programming time.
This is not an exhaustive list, but merely a few ideas of easy ways to be more efficient.
With SAS datasets, sometimes you have data items that need to have a long length to incorporate large values but most of the values might only need a fraction of that length. When it comes to storing that dataset, SAS makes room for the total length of each value, regardless of how much space the actual data requires. This can be very inefficient since you are using up storage for blank space in your data file. Using the compress option on the dataset or program can alleviate this problem.
An example of how compression works in SAS is shown in the figure to the right. The dataset shown has character lengths of 20, 50, and 10 for the data columns “name”, “address”, and “phone”, respectively. The Uncompressed section shows how the data file would look if each of the values were to take up their entire allocated length, no matter what the values were. In the Compressed section, you can see that each of the values only take up the space of their value (for example Ryan = 4-character length). The blue space in the diagram represents the storage space that is saved by using compression and can be used to store other values.
As an example to see how much space can be saved, we are going to use the compress option on the SASHELP.BIRD dataset.
This dataset only has four data items, but one column, text, has a length of 1200 so that it can include long values even though most values only use around 20-50 characters. The code shown to the left is the code that we executed. After execution, we ran a PROC COMPARE on each dataset. The birdNoCompress dataset had a file size of 768KB and the birdCompress dataset was only 192KB, the compressed dataset was about 75% smaller than the original!
Now there are a few things to note about using compression before using the (compress=YES) option of the data step for every program:
It can be set as on a per-dataset basis (see BIRD example above) or as a system option. The system option compresses every dataset that is created in your program.
You can only compress numeric or character values in a dataset, not both. “compress=YES” and “compress=CHAR” compresses character values only; “compress=BINARY” does numeric values.
Using compression decreases I/O and storage space when processing, which is good, but it also increases the program runtime, since the dataset must be uncompressed before being processed. For this reason, compression should be selectively used when working with larger datasets.
The REUSE= dataset or system option needs to be set to YES in order to use the freed-up space for new observations. If the option is not set (default is NO), then new observations are simply appended to the data file.
If your data is not stored in a database, then tuning the SAS datasets can be helpful. One method to improve the query time from SAS datasets is to add an index. Indexing gives a SAS dataset direct access to specified observations. Instead of reading every record, SAS can quickly find the requested observations. In some cases, the technique reduced the query time by 95%.
SAS Indexes come in two forms, simple and composite. A simple index is created on one key variable while a composite index is created on multiple variables. Indexes are typically used if you are filtering data frequently or joining columns.
Indexes can be created in PROC DATASETS, PROC SQL and the DATA statement. Here’s an example using the DATASETS procedure. An index is added to the SALES_REP data set.
SAS indexes return observations in sorted order. This eliminates the need to sort the data set in the subsequent data processing. Indexes may not be appropriate for all situations, so refer to the SAS documentation for more detailed information.
It is difficult to quantify the added value to an organization when code executes in a shorter time period. Whatever the case is — whether you’re testing code changes or updating a system process — if it runs in a few hours (or quicker!) rather than overnight, you not only save time but also money.
Using one or more of the coding efficiencies mentioned above, you can run jobs in a shorter time frame.