Troubleshooting SAS Platforms Using a Holistic Approach
08/08/2016 by Nick Welke Modernization - Infrastructure
I don’t know, something is just not right. A system administrator might cringe at those words. It can mean full-blown system meltdown is imminent. It may also mean only a data table is missing.
Sometimes the SAS log provides an error code. This is good. You know what widget to repair. When a general error occurs, you may not know what to do next. For instance, if the log states a connection is not available. The log is, in essence, saying – something is just not right, dude.
What should your next step be?
Errors in the log shows an issue exists. Your job is to find it. Using a systematic method, you can find it quicker. When studying networking, you most likely find yourself reviewing the Open Systems Interconnection model. The OSI model breaks apart the communication layers. Each layer builds on the last layer.
The idea is that unless each layer meets its requirements, there is no way for the next layer to be successful. When you think of communications, it is more clear that each layer relies on the previous layer. This is only a conceptional model. In the real world, you find tools that go across several layers or cannot be segmented into one layer.
We can extrapolate the OSI model into a more general system view. In this diagram, the layers are simplified into 5 layers with the added Layer 8 aka User layer. The user’s knowledge level and requirements add complexity or confusion.
Image credit: Gvseostud (Own work) via Wikimedia Commons
Using this model as an example, you can take a systems approach to your troubleshooting efforts. To the above model, I added a typical SAS environment. This SAS 9.4 environment uses several tools, such as SAS Enterprise Guide and SAS Visual Analytics.
There is a reason I noted the layers. When troubleshooting and you do not know what the issue is, then you have to find out where the issue is. For effective troubleshooting, start at the bottom layer and prove that it works.
For instance, you may do the following tests:
While some of these tests might seem silly to discuss, imagine if one of them is the cause of the failure? It can lead you down the discovery path a lot quicker. If you are keeping the layers in mind, you can eliminate guesswork. Since you know what is right, you can isolate what is wrong faster.
Here’s a recent example we experienced and how the ZenGuard team applied this method.
The customer complained about slow system performance and inconsistent connections. The logs showed a wide array of errors from the various applications. The issue was not always repeatable and it would come and go. This made the troubleshooting the task more difficult. The initial tests using ping were not revealing any issues with connectivity. We set up a longer ping test, we started seeing the breaks in the network connectivity. This lead to a network interface controller (NIC) card that would only work with certain VM hosts.
Once we suspected hardware issues in the lower layers we were able to prove our theory. In this case, the issue was near the bottom layer but was masking itself with issues in the upper layers. This is an example of how understanding and confirming that each level is working prior to moving to the next.
A healthy dose of suspicion helps in your detective work as well!