The Usual Suspects: Money Laundering and Entity Resolution
06/16/2020 by Erik Meyer
“The greatest trick the devil ever pulled was convincing the world he didn’t exist,” is a quote from the movie The Usual Suspects. The greatest trick any money launderer can pull? To conceal their identity so you never know they exist.
Fighting money laundering and financial crime is about uncovering suspicious activity, and this can be difficult when the ‘suspects’ try to conceal their true identity. Do you really know who your customers are? This knowledge is essential, and one important part of an anti-money laundering or fraud solution is a solid entity resolution (ER) process. How can you group seemingly unrelated parties together and uncover the (un)usual suspects?
Learn how to use regular SAS code in any project to assess the data quality, standardize your data and resolve the entities (e.g., customers). However, you will take advantage of SAS Quality Knowledge Base (QKB) for some of the steps in the ER process (any function/procedure that starts with ‘DQ’). And QKB does require a separate license. An entity resolution process should be part of a data management and data quality solution, in addition to a broader Customer Due Diligence (CDD) framework.
Before you start to build or develop any data solution, you need to get to know the data. Column profiling is a useful method that provides statistical measurements associated with the frequency distribution of data values and patterns, such as count, unique count, null count, and pattern count. This can give insightful information about your data and reveal the usual (data quality) suspects, such as outliers and missing values. There are several different SAS procedures you can use to evaluate the content of variables. Below, proc sql is used to look at a field you will later use in the ER process:
proc sql; select party_id_type, count(party_id) as count_party_id, nmiss(party_id) as missing_party_id from work.inndata; group by party_id quit;
In SAS Studio, the default output result looks like this:
Are the results as expected? Use your domain knowledge (or get insight from the right people) to evaluate. Maybe you need to improve or transform the data before continuing the process?
For a numeric variable, you can also use proc means to get some summary statistics:
proc means data=inndata n nmiss min mean median max std maxdec=0; var amount; run;
Default output looks like this:
Again, assess the data quality! Do the values look correct? Do you have outliers? Are missing values expected?
Another procedure that can be useful is proc freq that produces one- way to n-way frequency and cross-tabulation tables:
proc freq data=inndata; tables state_code / missing nocum nopercent; run;
Default output looks like this:
In addition to missing values, non-existing state codes might cause problems.
Lastly, look at the DQPATTERN function; this function returns a pattern analysis from an input character value.
select count(*) as freq, dqpattern(date_of_birth,‘character’) as dob_pattern
group by dob_pattern;
Default output looks like this:
You can see that most date values have a pattern of only numeric characters, but 33 records have non-alphanumeric characters (a date separator) that need to be removed for consistency.
SAS Anti-Money Laundering Solution runs scenarios that analyze entities and transactions to detect suspicious activity and generate alerts. All related alerts are aggregated to an entity level (e.g., party or household) so that analysts and investigators can get a complete picture of an entity and all its associated alerts before a decision is made about the suspicious activity/behavior.
An entity resolution process will consolidate identical and near-identical party records into a single party. All related transactions will hence be assigned to the correct party and included in the scenarios that detect potential suspicious activities. Without an ER process, those transactions could be split between two or more parties and fall below the threshold levels that trigger alerts. This again might lead to productive suspicious activity reports (SARs) not being filed.
A high-level overview of the steps in an entity resolution process can be found below. At Zencos we develop and customize unique solutions for each of our clients depending on their needs.
Step one will often take place after you have profiled your data and know what issues need to be dealt with. Different data sources need to be reconciled, different fields might need to be combined and individual fields might need to be transformed and enriched. Examples of this might be combining a sender or receiver address to a consolidated party address field or changing a numeric date field to a character field so it can be used with the ‘DQMATCH’ procedure (discussed below).
If your sources aren’t organized and reliable, no solution will be able to ensure BSA compliance under an avalanche of false positives. There are solutions to avoid those false positives that overwhelm your AML efforts. You need to get your data under control so the results from your AML efforts are clear, meaningful, and actionable by your investigators.
The second step in the entity resolution process is to standardize the fields you will use. This not only improves match results, but should be done regardless if you want to increase the readability and consistency of your data. The ‘DQSTANDARDIZE’ function in SAS returns a character value that is standardized in the appropriate case, spelling, and format of certain words and abbreviations, with insignificant blank spaces and punctuation removed.
In the code below you select a specific ‘standardization definition’ (the second argument in the function) to standardize the appearance of data values for a selected field. Here are some examples of inputs and outputs of the standardization function:
data output; set input; party_name = dqstandardize(party_name,'name'); street_address = dqstandardize(party_address,'address'); street_city = dqstandardize(street_city,'city'); street_state = dqstandardize(street_city,'state/province'); phone_number = dqstandardize(phone_number,'phone'); run;
After standardizing your data, you can move on to the next step: matching the input records. The QKB contains match definitions used as a facility for performing context-specific fuzzy matching for common types of data, such as names, addresses and phone numbers. A match definition reads an input string and creates a corresponding “matchcode,” which is a fuzzy representation of the input string, and reflects the relative similarity of data values.
The ‘DQMATCH’ procedure generates a cluster number for input values that have identical matchcodes. A specified sensitivity level determines the amount of information in the matchcodes. The amount of information in the matchcode decides the number of clusters and the number of entries in each cluster. Higher sensitivity levels produce fewer clusters with fewer entries per cluster. Lower sensitivity levels are used to capture values that use different spellings to convey (or hide) the same information. The ER process becomes an iterative process of adjusting sensitivity levels and selecting different candidates for match codes and assessing false positives and false negatives with each iteration.
The code below shows an example of how this can be set up (in real life you would use more conditions to improve the clustering):
proc dqmatch data=work.inndata cluster=cluster_ID no_cluster_blanks out=work.matched_parties; criteria condition=1 var=party_name matchdef='name' sensitivity=85 matchcode=mc3; criteria condition=1 var=street_address matchdef='address' sensitivity=75 matchcode=mc4; criteria condition=1 var=street_city matchdef='city' sensitivity=85 matchcode=mc5; criteria condition=1 var=street_state matchdef='state/province' sensitivity=95 matchcode=mc6; criteria condition=2 var=party_name matchdef='name' sensitivity=85 matchcode=mc7; criteria condition=2 var=date_of_birth matchdef='Date (YMD)' sensitivity=80 matchcode=mc8; criteria condition=2 var=street_city matchdef='city' sensitivity=85 matchcode=mc9; run;
Here you are clustering on two different conditions (a logical OR is applied between) with several criteria statements for each condition (a logical AND is applied between). In the example above, you are first clustering on name, address, city and state; and then again on name, date of birth and city. The result might look something like below (constructed result). The ‘Cluster_ID’ column shows that the transactions have been matched – they have the same ID:
The ‘DQMATCH’ syntax has many optional arguments that can be used to achieve a result that fits data and requirements for organizations of any size. It is important to use experienced consultants to achieve the best results. This is true for the remaining steps as well. There are many pitfalls in the entity resolution process; understanding your data and how to handle different variations in an AML/fraud setting is essential.
The incoming transactions have now been clustered, but note that existing party records were not considered. This can be done, but it depends on your data. It is essential that you do not re-cluster existing parties to avoid orphaned records in your party dimension. An alternative is to use the matchcode generated for incoming records to join/merge with existing matchcodes in the party dimension table. This way you will find existing parties and get the party numbers so they can be added to new transactions for the same/associated parties. New parties (no match found in existing parties) can then be assigned a new party number.
The last step in the entity resolution process is to solve the ‘master record’ (or golden record) for all incoming clustered records. Deciding what should become the master record can be done in several ways based upon your needs and preferences. For example, you may choose to select the value that occurs the most for each field (see example below) or just the most recent record.
In addition to creating a master record, consolidation of party entities into a household dimension is quite common. Note that household can be freely defined in the SAS AML solution – it does not need to be parties that share the same address. The household dimension can be used to group parties together that share some common element, such as the same phone number or the same ID number. This can be useful either when a single person manipulates a name and address in such a way that the entity resolution process for parties does not create a match or when several people are involved using a common identifier.
The combination of party and household entity resolution allows businesses to monitor for parties that are using slight variations of name and address spellings to avoid detection as well as parties that are using completely different information but the same ID or phone number.
Uncovering money laundering, financial crimes and related suspects requires organizations to correctly identify their customers. This can be tricky when the parties try to manipulate the information they leave behind, such as names and addresses. A robust entity resolution process built as part of a larger data management framework is an important step in revealing the true identity of the customers involved in suspicious activity.
Zencos has years of experience building solutions uncovering financial crimes using SAS. We help organizations of any size find both the usual suspects … and the ones you never thought of.