Use Fairshare in LSF and Get Better Performance
06/03/2021 by Brian Kowalczyk
I started the day with the intention of writing a blog about getting performance as it relates to a SAS environment. There have been several blogs already written covering topics such tuning system options, structuring databases, and coding efficiencies as well as diagnosing slow performance. Below are links to a few excellent reading sources covering these topics:
So rather than rehashing several blogs that have already covered this topic, let us look at how LSF Fairshare can get us better performance for our critical business needs.
Fairshare scheduling has been around in various forms going back to the early years of computing. In its most early form, fairshare scheduling wasn’t even software.
In the early days of computing, computers had very limited processing power and were expensive. There were no options to burst processing into a cloud. It was typical for a department to share a single computer between several individuals. Since demand for computing resources often exceeded what was available, time slots were evenly distributed to individuals to ensure that computing resources were shared in a fair manner. When a developer’s time slot ended, the developer often had to wait a day or more before having access to the computer again. This gave developers plenty of preparation time to make sure that they had well-organized, efficient code.
As technological advances allowed several users to be able to connect using dumb terminals to large mainframe computers, physical sharing became obsolete. The National Science Foundation was funding supercomputer centers at universities and networking these centers with other universities. Digital data collection and data analysis were expanding and a need to fairly share these interconnected resources was rapidly necessitated. Fairshare scheduling software was born.
Fairshare scheduling software equally distributes processing to its users. If there were four users on a system, each of them would get 25% of the processing power. When a user is running multiple processes, they will all run, but each process run by that user would only get a fraction of that user’s 25% processing power. So, each process that the user had running would consume 12.5% of processing power.
Fairshare scheduling can also be applied on a group level as well. When four groups are given equal shares, the combination of the processes from all the users in that group would be able to get 25% of the processing power. So, when one user in the group is running four processes, and another user in the group is running a single process (for a total of 5 processes being run for the group), each process would get 5% (25%/5 processes) of the processing power.
Fairshare scheduling can also be applied as a combination of groups and individual users. Perhaps user A and user B are doing advanced analytics requiring higher processing power. Fairshare is flexible enough to say user A gets 25%, user B gets 25%, group C gets 25%, and group D gets 25%.
However, the shares do not need to be equal. Group C could be given twice as many shares as group D. This flexibility is what allows us to get computing priorities set up to match business priorities. In the Phoenix Project novel, we realize how critical it is to have these aligned. Most people in IT have read the Phoenix Project (one of my favorite novels about DevOps), but if you have not, pick up a copy here:
Additionally, fairshare is not a constraint on the maximum resources that can be used by a group or individual. It is a sharing mechanism for when processing power is being constrained.
As an example, when we have a set with four groups with equal shares, but only two of the groups are running processes, then each of the groups would be able to use 50% of the processing power. When one of the groups does not have enough processing going on to use 50% of the processing power, then the other group would be able to go beyond 50% of the processing power to use any processing power unused by the other group.
Some organizations think that they have a large LSF grid environment and when resources start getting constrained, they can scale out to additional machines. Furthermore, this environment can handle the growth of the company so that critical parts of the business can continue to function. Since the environment is not constrained for long periods of time, these organizations do not see a need for using fairshare. However, this mentality can be hiding inefficiencies.
What happens when a team that was using a small standalone SAS install is added to a large SAS/Grid environment shared with others? Well, when the load on SAS/Grid is low, that team will have much larger processing capacity than they had in the previous standalone environment. This may cause the team to be less concerned about coding efficiencies given this additional processing power. This lack of concern will create additional loads on SAS/Grid environment than what they had put on their smaller standalone environment.
While it may make sense to add that team to the larger SAS/Grid instead of being on a standalone install, it does not make sense for them to be using processing power that should be used for more critical business needs.
This is where fairshare can be utilized to make sure more critical business needs are given proper resources. By using fairshare, the more critical business needs can be insured a certain amount of processing power regardless of how much the team that was migrated attempts to use.
So, how do we implement fairshare in LSF? In LSF, fairshare can be implemented on either the queue level or host level. These configurations are mutually exclusive. For simplicity, it is best not to configure both. To have a user’s priority in one queue depend on activity from other queues, configure host-level fairshare. To have a user’s priority only determined by activity in a specific queue, then configure queue level fairshare.
To configure on the queue level, modify the LSB_CONFDIR/cluster_name/configdir/lsb.queues file.
To configure on host level, modify the LSB_CONFDIR/cluster_name/configdirlsb.hosts file. The syntax is straightforward:
More information on the FAIRSHARE= parameter can be found at:
Be sure to specify a few shares with default or others for the user_or_group. Otherwise only the individuals and group specifically mentioned will be able to use the host or queue.
The number of shares that are assigned to each user is only meaningful when you compare it to the shares assigned to other users or to the total number of shares. The total number of shares is just the sum of all the shares that are assigned in each share assignment. The sum can add up to any number. I prefer to have the shares add up to a hundred, just to make it easy to see to it as a percentage. When a group has 20 shares and sum of shares adds to a hundred, you can quickly tell that the group has 20% of the shares.
After making the changes to the configuration, be sure to either run badmin reconfig or badmin mdbrestart for the changes to take effect.
Computing resources continue to get cheaper and more powerful but that doesn’t mean you can ignore efficient usage. Understanding how to use tools like Fairshare ensures that critical business needs are able to get the proper processing, and priority, they need and because your computing resources are being used more efficiently, the need to upgrade will be father down the road.