SAS Visual Analytics: Understanding Correlations
11/29/2016 by Chris St. Jeor Modernization - Analytics
In SAS Visual Analytics, there is a data analysis feature that looks at two measures and calculates the correlation between them. This feature determines the possible relationships between the measures.
Throughout this short tutorial, you learn how SAS calculates and labels correlations, where you can find them in the objects, and how to interpret the possible relationships between the measures.
Correlations in SAS Visual Analytics are calculated by using Pearson’s product-moment correlation coefficient calculation. This calculation takes in two measures and determines how much they are related linearly.
The range of the Correlation value can be anywhere from -1 to 1. Anything from -1 to 0 indicates a negative relationship, which means that as one of the measures increases the other decreases. A correlation of 0 shows no relationship at all.
Positive numbers from 0 to 1 indicate a positive relationship, which means that as one measure increases so does the other. SAS identifies these ranges of ratings for correlations as being Weak, Moderate, or Strong.
Correlations between two measures can be calculated in the correlation matrix or through a linear fit line in the heat map and scatter plot.
In a correlation matrix, there are two options in the Roles tab under Show Correlations to display the correlations between the measures that you want. The option within one set of measures takes a set of measures and displays them in a matrix against themselves so that, in a triangle format, you see each measures correlation against one another.
In the following figure, we measure seasonal team baseball statistics against one another. This dataset combines all team seasons from 1921-2009 and totals up team statistics such as Hits, Home Runs, ERA, and so on.
After adding in WinPct (Win Percentage), Hits, ERA (Earned Run Average = Measure of earned runs given up per 9 innings), FieldPct (Fielding Percentage = Measure of successful defensive plays), and OnBasePct (On Base Percentage = Measure of times a batter gets on base per plate appearance) to the measures in the Roles tab, we get our matrix of correlations.
The bar at the bottom shows that the color displays how strong the correlations are. If you hover over any of the boxes, then you see the data point box which gives you the measures that were calculated, the correlation, and how SAS categorizes that correlation.
In the following example, Hits and OnBasePct have a strong correlation, which makes sense because every hit that a batter gets directly influences their on-base percentage (OnBasePct).
Now let’s look at something that might be useful for our analysis. Win Percentage (WinPct) is the goal of all baseball teams since you need to have one of the top win percentages to make the playoffs each year. In the next figure, between two sets of measures is chosen, and Win Percentage is put on the X-axis. Then the Y-Axis is filled in with all of the measures that we want to compare against one another to see which statistic is most heavily correlated with WinPct.
Using this option helps cut down on the matrix and allows the user to see just the set of correlations that they want to compare. You can add more measures to the X-axis, but the point is that it cuts out the full matrix that you get with the one set of measures option.
In the heat map and scatter plot objects, there is an option to add in a fit line to your chart. If you add a linear fit line or get the linear fit line as the best fit, then the analysis tab at the bottom of the object also calculates the correlation value between the two measures being analyzed. In the scatter plot below, we look at WinPct and ERA.
With a large dataset and a strong correlation between two measures, you might assume that they have found a relationship between measures. Sometimes that is not always the case. The phrase correlation does not equal causation is common in the field of statistics and means that just because two measures have values that are related—which is measured by correlation—it does not mean that the concepts behind the measures have a direct relationship. There are many different forms of an apparent relationship between data items. In his Now You See It book, Few breaks down correlations to meaning one of four possibilities:
So in the previous two figures, we were looking at win percentage against other measures to see which ones were the most correlated. In the correlation matrix, each of the five measures has a moderate relationship with win percentage.
This makes sense because all of those measures have an influence on the outcome of the game. ERA had the strongest correlation at -.53. This means that as a team’s pitchers give up fewer runs on average, we would expect them to have a higher win percentage.
The correlation indicates that a lower ERA causes a higher win percentage, which we know to be accurate based on the rules of baseball. In this example, the correlation of the values of the measures was indicative of a conceptual relationship between those two measures.