Fast Name Matching

RichardGrimmer

After suggestions for algorithms, products or anything else in the arena of high performance name matching. Background - we allow our clients to upload a list of names to our systems, and then we screen those names against a list we hold on our side looking for matches. Ballpark figures are ~500K uploaded records being compared against >5M on our side. We've currently got a couple of ways of doing this, from a Sql DB & Entity Framework (no giggling at the back), to a full on in-memory dictionary with all possible parts of every name broken out and indexed (I said no giggling). I'm looking for high performance, high accuracy matching, and it would be nice if I could also do things like distance matching (probably using Levenstein), but the core matching needs to be as fast as possible. I'm considering an in-memory Sql instance (not ideal), Apache Solr (not sure, never used it) etc and I was wondering if any of you had experience and could make some recommendations

C# has already designed away most of the tedium of C++.

Lost User

Is Curaçao the same name as Curacao? Should the matching be case-dependent? For an ASCII comparison, I'd compare ordinals; with the list partitioned by string-length.

Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

RichardGrimmer

Hi Eddie, Thanks for the quick reply.... On the various types of match, we have some rules defined around things like aliases, diminutives etc, so we could express those in a rules engine, or if I have to code this up myself, then in the C# - completely agree that if I do end up doing it myself, ordinals etc are the way to go... I'm not really after advice on the actual matching, more the sort of infrastructure I can put around it to make it as fast as possible...to be honest, if there's something "off the shelf" that would be ideal lol...

C# has already designed away most of the tedium of C++.

Lost User

Depends on budget and hardware. If you can use multiple computers, then you would send your "name_to_check" to those PC's, and have them each check a part of the list. Same principle can work on a single machine, using queue's and threads, but the fastest would be to use a *few* dedicated computers :-D --edit Italic part added for clarification.

Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

Keld Olykke

A year ago I had a similar problem. We should save requests on a server and each request had an id Guid/UUID. The solution turned out to be simple and fine. We simply wrote each cipher (in the guid) as a folder. That meant we got a folder hierarchy 32 deep with 16 branches on each level. Now one should think that it would have a terrible performance, but it actually was quite fast on linux with a solid state drive. Another benefit was that maintenance e.g. backup was easy with this solution. Maybe you could try something similar.

Nathan Minier

I know that I'm late to the party, but if you're just checking for the existence of the names why don't you run it against an in-memory hash array? Bootstrap it from your database and order it, then you should just have to hash the records and run a binary search with that. Rainbow tables can be used for good, as well as evil!