Seven Best Practices Gathered Over a Quarter Century of SAS Deployments
06/02/2022 by Nick Welke Support
Zencos and their employees have worked with thousands of SAS customers over the last 30 years. Customers come from industries like manufacturing, retail, government, banking, insurance, gaming, Energy, and telecommunications. What they all have in common is the need to keep their SAS software and hardware environment tuned and running smoothly. This extensive experience has led to a number of best practices that all SAS customers should follow.
While it’s true that logs, and particularly SAS logs, can be indispensable in troubleshooting an issue, these logs are only useful if they can be readily searched and curated. While many SAS logs are broken up into 24-hour chunks (a new log each day with the date in the filename), there are also many logs that grow indefinitely. These larger logs can be problematic for several reasons. They take up disk space when you most likely don’t need them. Have you ever needed to know all user sessions started six months ago? Let alone twenty-six months ago? Second, once you determine there is a problem, only a tiny percentage of the log entries are actually relevant to solving the issue. Of course, you can fix this, but why leave these extra steps for when you’re rushing to fix a problem. It’s much better to set up a way to keep your logs recent and meaningful with procedures to age off accumulated log files and recreate new instances of logs that don’t have an easy aging mechanism.
Find out more about managing your logs.
We’ve all been there – that sinking feeling when a production system is not coming up after an update, Disaster Recovery (DR) exercise, or some unplanned event. It’s a daunting prospect to restore functionality to a system when you’re not sure what’s wrong or how long it will take. And usually, this happens when some critical need for the system is right around the corner – month-end processing or some high-visibility event. These are the moments when having a backup really pays for itself.
We’ve seen literally hundreds of systems that couldn’t come back up due to a drive failure, a patch that was pushed through the whole enterprise, or any of the multitude of circumstances that result in restoring from backups. Whether it’s a SAS metadata backup, SAS deployment backup, a filesystem backup, or a system snapshot, this is the single best way to avoid turning a problem into a catastrophe. The best practice is to be sure and back up your system.
So backups are great, but a high percentage of the backups we’ve seen at various customer sites had some sort of error. Permissions, missing files or directories, short retention, incompatible encoding, and many more flavors of “We couldn’t restore it fully.” These possibilities are why a backup regime is incomplete until it has been tested as a successful restore point. It’s not enough to back up your data and software. It’s not enough to restore it all. You have to be able to use the restored environment and validate that everything is where it’s supposed to be. In most systems, you can restore to a secondary path or secondary machine and avoid impacting the production environment.
As with log management, this is best tested and refined when there is no urgency. There’s no worse time to find out that your backups are incomplete than when you have to rely on them to bring up a production system. Test your backups.
Most computer systems are expected to last four to five years. Many last longer, but most computer depreciation models use a five-year lifespan for computer systems. Heavy processing loads, such as intensive floating-point or GPU computation (SAS and other statistical analysis, data modeling, machine learning, etc.) can reduce that lifespan. Similarly, large amounts of disk activity (reading and writing large amounts of data, something every SAS site we’ve ever worked with does) increase the probability of mechanical failure in spinning disks and heat-induced data errors in SSDs.
All this means that it’s important to monitor system performance from the very start. You can gauge how age and usage impact a given component’s performance with unit tests. A simple piece of code can help monitor the whole system by testing all the major subsystems (CPU, memory, local I/O, and network I/O) consistently over the years. Like retirement, computer sunsetting doesn’t spontaneously happen – we can and should plan for it by having a sense of when systems are beginning to show their age and how we’ll migrate functionality and workloads to the next generation. Monitor and plan for system retirement.
Aside from aging, the other mode of change for systems is growth – more users, more data, and more ways to process and utilize data. All our monitoring has to also allow for this growth and take into account how environments can grow in a healthy and effective way. Traditionally, computer scalability has been described as vertical or horizontal. Vertical scalability refers to adding more resources to a node or machine (more RAM, more CPUs, more disk). Horizontal scalability refers to adding more nodes or machines and distributing the workload to the added systems. Both approaches have pros and cons, but more importantly, they have their limits. In particular, there are physical limits on how much RAM or how many CPUs you can install into a server.
Conversely, there is a limit to how much benefit you get from adding more nodes as the overhead of managing additional nodes increases. These considerations have been part of computer system architectures for decades. Still, there’s an additional layer that frees us (to an extent) from these concerns and brings in new considerations. Moving to the cloud makes both vertical and horizontal scaling a much more attractive undertaking and even allows scaling to the actual workload requirements in real time rather than buying and installing hardware to handle peak loads. We’ve seen many customers move at least some of their processing to cloud-based architectures, which has been a largely positive change. Plan for growth.
I often look for ways to describe esoteric software or hardware systems in non-computer terms because I’ve found that it helps me to better internalize what I’m talking about. This approach illuminates portions of the system that have a single point of failure. Having one component that absolutely cannot break, or a single person who holds essential business process knowledge are single points of failure that need to be avoided. These single points of failure are equivalent to weak links in the chain of components that pull your business from a pile of data to refined and usable information. Weak links eventually break, but the computing field has done a lot to compensate for this. Computer systems now have a whole host of mature technologies to mitigate single points of failure. You can use clusters of metadata, midtier nodes, load-balancing network switches, or RAID storage arrays to avoid issues. And like any chain, once you’ve strengthened the weakest link, the next weakest becomes the most likely to break. Process redundancies, backup restore tests, DR exercises, log management automation, and runbooks for human tasks are the next level of building resilience into your computer system. Find the weakest links and strengthen them.
All of the best practices above benefit from automation’s “force multiplier.” A single administrator can manage many systems by automating log management. A script can alert them when an error is found or report on anomalous activity. Backups are almost always automated to take place at the least intrusive hours, and they can also be mostly verified in an automated “test, restore, and compare” step.
System monitoring cries out for automation as most of us have better things to do than watch I/O or CPU metrics all day. Dynamic workload management is by definition an automated system, as are load balancers and many other components meant to improve systems resilience. But where I wanted to focus is on the repeatability of deploying consistent tools, aliases, and directory structures to ensure that adding someone to your team can happen with minimal overhead. If the environment they’re joining is laid out similarly to the others they work on, and the same commands result in the same types of information, new team members can become effective immediately, even if they’ve never logged into a system before. Enabling this commonality has been the biggest benefit of adding consistent aliases and directory structures across our customer systems. Automate where possible.
Technology is advancing at a rapid rate but at the end of the day, there are still things you need to do to keep both software and hardware up to date and running smoothly. Understanding the 7 best practices outlined above is a good start. If you need assistance, find a partner with the correct experience. Zencos has been working with SAS environments for over 30 years and can handle all your needs. Check us out at www.zencos.com.