blog

Optimizing AML Alert Generation: Getting Your Sample Size Right

04/15/2019 by Calvin Crase Financial Crimes

In our anti-money laundering (AML) and fraud detection world, many of us have programs (or scenarios) that detect potentially illicit financial behavior. Scenarios involve using some predetermined threshold values that you can use to compare customer behaviors against. Once this is satisfied you can alert on some entity or person depending on your scenario.

The question that is left to answer is, how does an institution determine what the appropriate threshold is? Answering that question is our goal today. Determining the answer to this question is important. The rate of false positives generated is around 95%, so minimizing the amount of time analysts spend working on these false positives means saving real time and real money.

For example, let’s suppose we have something we call a structuring scenario. This is where someone is depositing small (though maybe not so small for folks like you and me) amounts of money several different times to the same account to avoid a currency transaction report (CTR) filing. In our structuring scenario, we have a variable we’ll create called an AGGREGATE_AMOUNT that we are using to add the total amount of money this person is depositing over all their smaller transactions in order to compare that value (aggregate amount) against a threshold value. The threshold value in our example then is going to be a dollar amount.

We want to make sure we are setting that dollar amount optimally. Should it be $10,000? Should we raise it higher? Or should we have it below at $9,000? $8,000? Let’s walk through a scenario where we test/identify if we are focused on the right targets.

Implementing a Below-The-Line Testing Strategy

Below-the-line testing is a method of relaxing the threshold value you currently have your scenario running with in order to generate alerts so you can make sure you aren’t missing some true positives, or to make sure that your threshold isn’t unnecessarily low and you are generating false positives. In our example suppose we re-run the scenario with a threshold value of $1,000. This means that any person who conducts any number of transactions aggregating to $10,000 will now have alerts.

With the lower threshold, new alerts would be generated that would not otherwise have been generated. With these new alerts, we want to determine if $1,000 is closer to reflecting the risks your business faces, or if $10,000 or something else entirely is more appropriate. Reviewing these new alerts helps evaluate whether the original threshold was appropriate, or the new threshold would be an improvement.

How to Get to an Optimal Sample Size and Clustering as Stratified Sampling

Truman hearing the good news. Picture: Wikipedia

Our purpose now is to figure out how many of the alerts generated at this new threshold are going to be something we’ll call ‘productive’. We’ll call a productive alert an alert that you really think might represent illicit activity. A non-productive alert would be something where if you were to look at this person’s bank account and investigate them you would say that they probably got a nice check in the mail from grandma for their birthday. They don’t need to be put in jail for that. There are usually too many alerts to review them all. So, we’ll need to review a subset of the alerts, or a sample, as a way of evaluating the group as a whole. We need to choose these subsets so that we can use them to meaningfully judge the rest of the alerts. There is a great deal of variation across all the alerts, and if we don’t partition our alerts into groups, then our sample may not reflect the population. We will implement stratified sampling.

It’s important that your sample is representative of the group. When Gallup polled the country to evaluate the Truman/Dewey presidential race, they had failed to take into consideration that telephones were generally limited to the more well-off and Truman was less popular among affluent voters. This skewed sampling results and led Gallup to an incorrect conclusion. This famously incorrect prediction is captured in this image.

To avoid making a similar mistake as the newspapers, we partition our population into groups based on where they fall in the distribution. But we have a problem. We don’t have a nice binary indicator that says, ‘this person has a phone’ or ‘this person doesn’t have a phone’ like the folks surveying during the 1948 presidential election might have leveraged. We’re using something that we call a ‘continuous’ variable, namely dollar amount. This variable is distributed over a range of values, not just two. Our distribution is typically going to be right skewed with fewer people depositing lots of money and lots of people making smaller deposits. But we still need to make sure we split them into groups otherwise when we sample, we’re just going to get a bunch of people over on the left side of the distribution and maybe not get other people (because there’s less of them) on the right side. But we don’t want to omit the people on the right side, because hey, they might tilt the election.

So, with that in mind, our goal here is to minimize variation within each group with respect to our threshold variable, i.e. dollar amount. For example, some people may have transactions around the $8,000 range and a lot of people may have transactions in the $100 range. They can be thought to represent two different groups or clusters that we’d like to sample within.

Instead of having an analyst work through this and choose based on their own intuitions about what seems like the appropriate place to demarcate a group, we use a clustering procedure. Different programming languages have different procedures or functions that will operate slightly differently. The crucial takeaway point here is that we need to group individuals and this clustering procedure allows us to separate into delineated groups even though it looks like there’s no optimal way to do it. That is, it chooses the dollar amount such that it minimizes within group variation for the groups it chooses. So, to press the metaphor above even further: We now have households with phones and households without phones and now we can sample within these groups more accurately.

With our clusters established, we want to sample from within our groups. Our first question is, how large should our sample be? To determine the correct sample size, we can use the normal approximation of the binomial distribution. We can do this because we’re asking ‘is this an alert?’ or ‘is this not an alert?’ The binary nature of this question allows us to leverage the binomial distribution concept, and as indicated above, the binomial distribution looks a lot like the normal distribution. The sampling algorithm is a straightforward, easy to understand algorithm based on a predetermined productivity rate, precision, and confidence level that can be set based on industry-accepted criteria or tailored to specific business use cases. Industry statisticians and regulators typically advise 95% confidence, 5% precision, and 5% productivity.

Working the Alerts

Our final step is what can be termed ‘working the alerts’. We can give the sample alerts to the analysts who typically would investigate an alert to determine if each one represents an actual risk or a false positive. This process is visually represented in the following figure.

  • Red triangles represent alerts that after investigation are considered an actual risk.
  • Blue circles represent false positives or unproductive alerts.
  • Green squares are observations that we have not sampled.

So, if the threshold for the scenario above in image 2 was set to $250,000 prior to beginning your below-the-line testing, then it may be prudent to reconsider and lower this threshold to something like $200,000. Conversely, if it was set to $150,000 then it may make sense to raise the threshold to avoid investigating false positives in the future. As mentioned above making sure that these scenarios are properly tuned is not only important for compliance and regulatory purposes. It’s also a question of saving employees a lot of time.

The Best Decision for Your Financial Institution

When there are too many false positives, analysts are wasting time and money. Yet regulators need evidence of a repeatable methodology in your transaction-monitoring procedures and the processes in place for setting the threshold for scenarios.

This is a process that’s useful because conducting below-the-line testing is a requirement. However, providing a statistically defensible and repeatable procedure can be difficult and time consuming if not done properly. The approach detailed above meets the expectations of regulators and is flexible enough for your business needs.