blog

SAS Coding: PROC SQL Versus DATA Step

01/31/2016 by Jaime D’Agord Modernization - Analytics

The SAS programming language has a matchup between two processes that always have programmers questioning which one they should use. When creating, calculating, or doing any data manipulation in SAS, the most popular ways of programming these actions is through the DATA step or SQL procedure.

Each one has its own route in how to handle data and those differences, among others, are broken down below.

History – How these processes got started in SAS

The first formal release of SAS was in 1972, which included the DATA step and a few other original procedures but not the SQL procedure. Meanwhile in the 1970s relational databases started taking off and the SQL language was developed by researchers at IBM as a way to program within a database.

Throughout the 1980s, SQL became so popular among all relational databases that the American National Standards Institute (ANSI) developed standards so that it would remain consistent across all platforms. With the early releases of SAS 6 in the 1990s, SAS introduced the SQL procedure into the base language so that the SQL language was available as well as the DATA step for their programming. This new enhancement to the language allowed traditional SAS programmers to continue using the DATA step while also letting programmers with a database background to easily pick up the SAS language.

How does SAS process data?

Both methods are ways to do to create and manipulate data, but how after decades is one not much more accepted than the other? Let’s take a look at how each one takes in and processes data:

DATA Step

The DATA step is a sequential process, it brings in, calculates, and outputs a single record at a time. It starts out by taking records from the data source in the SET statement and loading them into the input buffer. Once the buffer is loaded, records start to get processed in the program data vector (PDV) which is a temporary placement for the record so that any calculations can be computed. From here the records are then loaded sequentially into the output buffer. Once the buffer is filled, then the records are then populated to the output SAS data set described in the DATA statement.

SQL Procedure:

If the DATA step is a sequential process by the record, then it is best to think of the SQL procedure as a simultaneous process for all the records. When a proc SQL statement is executed, the first step is the SQL optimizer scans the query inside the statement. The optimizer then intelligently strategizes how to execute the code. Once it has come up with a plan, then any tables in the FROM statement are loaded into the data engine where they can then be accessed in memory. Temporary tables are also created in memory where the code can then place data as it is calculating and building the results. Once the final table is created in memory, it is then sent to the output table described in the CREATE TABLE statement.

How do you strategize when coding each process?

As the programmer, you are the coach with how both of these processes handle the data. Each process needs to know what data is coming in, the calculations needed to be done, and which records to output. However, each process has a big difference in the philosophy of how you code it.

With the DATA step, you are giving the program instructions on how you want the data to be processed. Aside from the WHERE statement, which will be covered later, each statement within the data step is processed sequentially in a similar way that the data is inputted. The DATA step can best be thought of as a continuous loop that processes each record and statement one at a time.

When programming with the SQL procedure, you are coding what you want as an end result. The SELECT statement declares the output fields, the FROM statement states the input data, and all other statements are manipulative processes. This is all due to the SQL optimizer. When you have a procedure that will develop the process for you, it is easy to just put down what you want and let the intelligence do the rest.

Which one performs faster in SAS?

In deciding which process to use for programming, it would be good to have an idea of how fast each one can process a SAS data set. In order to understand the efficiency of each, a sample dataset of NFL player statistics was used as a test case. Three variable calculations were also added into the code and data so that it wouldn’t just be taking one dataset and moving it to a new location. Here is the code for each process:

DATA Step:                                                            SQL procedure:

           

The processing of data here is straight forward, just data in, calculate, and data out. This dataset also had only 238 records for the current players. That isn’t enough to notice a difference so the dataset was stacked on itself 100, 1000, etc. times to get large enough sets to work with. After running each one of the tests, here are the results:

As you can notice from the table, the SQL procedure fared better with the smaller datasets and then the data step excelled with the larger ones. This can be attributed to how each process handles the data in memory. The DATA step handles each record sequentially so it is never using a lot of memory, however, it does take time to process one at a time. So with a smaller dataset, the DATA step is going to take more time sending each record through. With the SQL procedure, everything is loaded up into memory at once.

By doing this, the SQL procedure can process small datasets rather quickly since everything is available in memory. Conversely, when you move to larger datasets, your memory can get bogged down which then leads to the SQL procedure being a little bit slower compared to the DATA step which will never take up too much memory space.

Notes:

  • This test was done on SAS Enterprise Guide 6.1 running on a SAS 9.4 server.
  • This is strictly for using the SQL procedure on SAS datasets. If you were to connect directly to a database and pull tables from there, then using a SQL pass-through would most likely be the most efficient way to run code.

Filtering is not so obvious between the two

When using each of these processes, a programmer is often sub-setting the data along with doing various calculations. Both the DATA step and the SQL procedure have different ways in how they filter out data which can be deceptive since each clause occurs in different spots of the data processing.

In the DATA step, the two filtering statements are WHERE and IF.

          

Each of these statements will not include a record unless the condition is met. The difference is that the where clause will eliminate records before they go into the PDV, the IF statement removes them before they get sent to the output buffer.

This is so that you can do calculations in the code and still remove records based on those calculations using the IF statement. The WHERE statement is used more for performance purposes to eliminate unwanted records as soon as possible.

The SQL procedure works similarly in that you have the where clause that will filter out records.

However, this statement works more like the IF statement in the DATA step in that it is processed during the data processing step and not filtered out beforehand.

It is important to note the FROM clause here because this clause is the first one that is executed in the process. If you are joining tables in the FROM clause you can specify variables after the ON clause according to what variables are to be joined and also anything that you want filtered out.

The SQL optimizer can also take a basic WHERE clause and pass it down to the data engine as if it occurred in the FROM clause. Filtering on calculations cannot be passed down though. So for performance purposes through filtering, the SQL procedure is easier to work with since it will intelligently know when it can and cannot filter out before execution.

Location and Conditions – What’s the environment look like?

This is a matchup, so there has to be a winner right? Yet, as you can see through the various sections, each process has its own advantages. The DATA step can handle processing large data calculations faster but the SQL procedure has the SQL optimizer that looks at the results you want and then determines the best way to execute. The true answer to this question relies on your environment and the settings that you are programming in.

If you are a traditional SAS programmer who is more comfortable with the DATA step and pressed on time, then using that is going to be best for your situation. On the other side, if your admins have set you up with enough memory to never have to worry about slowing the system down, then the SQL procedure might be better suited for your program. There’s always going to be certain situations where one might be better than the other. In the end, both techniques create/manipulate data in their own ways and finish up at the same conclusion.