Rolling your own OFAC search
When you do anything with USD payments, one of the first things your CCO will tell you do do is that you need to screen the names in payment vs OFAC lists. Normally, your compliance team will just configure this as a rule in whatever third party monitoring tool that you're using and be done with it, but what if you'd want to roll your own?
Scope.
Functional
The OFAC search website is a baseline that I wanted to replicate.
We take as an input a name and an entity type (indivdual / legal entity etc.), choose a confidence score from 0 to 100 (100 being an exact match, usually this is around ~80-85) and search if it occurs on any of the entries in the OFAC lists - the OFAC SDN - Specially Designed Nationals list or the OFAC non-sdn list - Consolidated List.

Non-functional
I want this to be reasonably fast, ideally under 200ms for search. - I tested a bruteforce implementation of OFAC search once in nodejs, and one search of both SDN & Consolidated lists took around 1,5 - 2s.
I want this also to be entirely within this site - which itself is written in Elixir. Now, elixir is pretty comparable to nodejs in performance, so to go faster, we'd have to go native. You can define bindings to native code in elixir using nifs, so I decided to offload the hottest paths to Rust.
How does OFAC search even work?
The FAQ provides some details:
Sanctions List Search then uses two matching logic algorithms, and two matching logic techniques to calculate the score. (...)
The first technique involves using the Jaro-Winkler algorithm to compare the entire name string entered against full name strings of potential match entries on OFAC's sanctions lists.
The second technique involves splitting the name string entered into multiple name parts (for example, John Doe would be split into two name parts). Each name part is then compared to name parts on all of OFAC's sanctions lists using the Jaro-Winkler and Soundex algorithms. The search calculates a score for each name part entered, and a composite score for all name parts entered.
Sanctions List Search uses both techniques each time the search is run and returns the higher of the two scores in the Score column.
Ok so we have a combination of fuzzy an phonetic matching. It mentions both the Jaro-Winkler and Soundex algorithms.
Unfortunately diving deeper in the FAQs reveals that there was an additional algorithm change in 2021 that introduced "new algorithm to the tool's fuzzy logic search funcitonality', and the exact details are unknown.
V1 - simple Jaro - Winkler
On the other hand, it's rare you'll come across a ticket that will provide as much details as the OFAC FAQ did, so I can't complain. Let's start with the first algorithm it mentioned - Jaro Winkler.
Preprocessing
Let's say we need to screen the extreme case - the estimeed "Dr. José García III" in his full glory, we need to cleanup both the ofac entries and the search itself. To do that, I used the following pipline:
lowercase-> "dr. josé garcía iii"unicode normalization- unicode lets you represent the same character in multiple ways, for example é can be one codepoint or two (e + accent mark), so I applied the NFKD normalization that expands everyting into simplest pieces, so the 2 strings that look identical can actually match in code, regerdless of how they were encoded. So "dr. josé garcía iii" get's his é decomposed into é → e + ◌́diacritics- strip the accents marks and replace all these ą and ę, leaving dr. Garcia as "dr. jose garcia iii"transliteration- tries to convert non-latin scripts into ASCII equivalent, using the any_ascii crate. Dr. Garcie is not affected by this but we're prepared for Владимир → Vladimirpunctuation removal- "dr jose garcia iii",title removal- leaving "jose garcia iii", no longer a dr.suffix removal- "jose garcia" and no longer the 3rdwhitespace cleanup- no-op here, but any stray spaces will be cleaned up, and duplicate merged into one
So we reduced estimeed "Dr. José García III" to less estimeed "jose garcia" but we'll make up the lack of titles with ease of comparison.
Jaro Winkler
First fuzzy algorithm - Jaro-Winkler compares 2 strings and gives us a score between 0 and 1 on how similar are those 2 strings - 0 being completely different and 1 identical.
The base of that is the Jaro score - it looks at three things:
- How many characters match
- How many characters are out of order
- The lenghts of both strings
The Winkler part of the JW is a later modification that adds a "prefix bonus" - strings that share the same starting characters get a score boost. This makes the algorithm "front-biased" - test vs xest
scores lower than test vs tesx, even though both have just one character different.
Elixir has it's own jaro-similarity function, but for now we'll go native and use jaro-winkler from the strsim create. So we first take the list, run pre-processing on it, and then run JW comparison, and return any matches greater then score. Here's a demo of that
DEMO: OFAC v1 — Plain Jaro-Winkler ▼ COLLAPSE
Pretty good! You can find Vladimir Putin pretty easily! It will also find Vladim Ptin. Thanks to all our pre-processing steps it also catches all the cyrylics - "Владимир Путин" and "Владимiр Путин".
How about "Putin Vladimir"? Well, in this case we get a big fat 0. The OFAC search website handles it just fine:

Even "Putin" itself causes multiple 100 points matches:

As stated in the FAQ, OFAC website does something with name parts as well, not full names only, let's try to replicate that.
V2 - Token Jaro-Winkler
OFAC search tries to compare each token in the name separatly, and then produces a "composite score" out of them. To replicate that, we can tokenize both sides of the target / query, and for each query token find the best JW match among the target tokens.
This should fix bot the name-parts reordering - tokens matched individually, regardless of order (Putin Vladimir instead of Vladimir Putin) and partial queries ("putin" matches "putin" token in "vladimir putin" directly.
Now all that's left is the 'composite score', we'll do the simplest hting here, final sore will be the max(full_string_jw, average_of_best_token_matches so the full-string path from v1 is still there as a fallback. As naive as it gets. Here's the demo: