Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Design and Architecture
  4. Fast Name Matching

Fast Name Matching

Scheduled Pinned Locked Moved Design and Architecture
databaseperformancecsharpc++asp-net
6 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    RichardGrimmer
    wrote on last edited by
    #1

    After suggestions for algorithms, products or anything else in the arena of high performance name matching. Background - we allow our clients to upload a list of names to our systems, and then we screen those names against a list we hold on our side looking for matches. Ballpark figures are ~500K uploaded records being compared against >5M on our side. We've currently got a couple of ways of doing this, from a Sql DB & Entity Framework (no giggling at the back), to a full on in-memory dictionary with all possible parts of every name broken out and indexed (I said no giggling). I'm looking for high performance, high accuracy matching, and it would be nice if I could also do things like distance matching (probably using Levenstein), but the core matching needs to be as fast as possible. I'm considering an in-memory Sql instance (not ideal), Apache Solr (not sure, never used it) etc and I was wondering if any of you had experience and could make some recommendations

    C# has already designed away most of the tedium of C++.

    L K N 3 Replies Last reply
    0
    • R RichardGrimmer

      After suggestions for algorithms, products or anything else in the arena of high performance name matching. Background - we allow our clients to upload a list of names to our systems, and then we screen those names against a list we hold on our side looking for matches. Ballpark figures are ~500K uploaded records being compared against >5M on our side. We've currently got a couple of ways of doing this, from a Sql DB & Entity Framework (no giggling at the back), to a full on in-memory dictionary with all possible parts of every name broken out and indexed (I said no giggling). I'm looking for high performance, high accuracy matching, and it would be nice if I could also do things like distance matching (probably using Levenstein), but the core matching needs to be as fast as possible. I'm considering an in-memory Sql instance (not ideal), Apache Solr (not sure, never used it) etc and I was wondering if any of you had experience and could make some recommendations

      C# has already designed away most of the tedium of C++.

      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #2

      Is Curaçao the same name as Curacao? Should the matching be case-dependent? For an ASCII comparison, I'd compare ordinals; with the list partitioned by string-length.

      Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

      R 1 Reply Last reply
      0
      • L Lost User

        Is Curaçao the same name as Curacao? Should the matching be case-dependent? For an ASCII comparison, I'd compare ordinals; with the list partitioned by string-length.

        Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

        R Offline
        R Offline
        RichardGrimmer
        wrote on last edited by
        #3

        Hi Eddie, Thanks for the quick reply.... On the various types of match, we have some rules defined around things like aliases, diminutives etc, so we could express those in a rules engine, or if I have to code this up myself, then in the C# - completely agree that if I do end up doing it myself, ordinals etc are the way to go... I'm not really after advice on the actual matching, more the sort of infrastructure I can put around it to make it as fast as possible...to be honest, if there's something "off the shelf" that would be ideal lol...

        C# has already designed away most of the tedium of C++.

        L 1 Reply Last reply
        0
        • R RichardGrimmer

          Hi Eddie, Thanks for the quick reply.... On the various types of match, we have some rules defined around things like aliases, diminutives etc, so we could express those in a rules engine, or if I have to code this up myself, then in the C# - completely agree that if I do end up doing it myself, ordinals etc are the way to go... I'm not really after advice on the actual matching, more the sort of infrastructure I can put around it to make it as fast as possible...to be honest, if there's something "off the shelf" that would be ideal lol...

          C# has already designed away most of the tedium of C++.

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #4

          Depends on budget and hardware. If you can use multiple computers, then you would send your "name_to_check" to those PC's, and have them each check a part of the list. Same principle can work on a single machine, using queue's and threads, but the fastest would be to use a *few* dedicated computers :-D --edit Italic part added for clarification.

          Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^][](X-Clacks-Overhead: GNU Terry Pratchett)

          1 Reply Last reply
          0
          • R RichardGrimmer

            After suggestions for algorithms, products or anything else in the arena of high performance name matching. Background - we allow our clients to upload a list of names to our systems, and then we screen those names against a list we hold on our side looking for matches. Ballpark figures are ~500K uploaded records being compared against >5M on our side. We've currently got a couple of ways of doing this, from a Sql DB & Entity Framework (no giggling at the back), to a full on in-memory dictionary with all possible parts of every name broken out and indexed (I said no giggling). I'm looking for high performance, high accuracy matching, and it would be nice if I could also do things like distance matching (probably using Levenstein), but the core matching needs to be as fast as possible. I'm considering an in-memory Sql instance (not ideal), Apache Solr (not sure, never used it) etc and I was wondering if any of you had experience and could make some recommendations

            C# has already designed away most of the tedium of C++.

            K Offline
            K Offline
            Keld Olykke
            wrote on last edited by
            #5

            A year ago I had a similar problem. We should save requests on a server and each request had an id Guid/UUID. The solution turned out to be simple and fine. We simply wrote each cipher (in the guid) as a folder. That meant we got a folder hierarchy 32 deep with 16 branches on each level. Now one should think that it would have a terrible performance, but it actually was quite fast on linux with a solid state drive. Another benefit was that maintenance e.g. backup was easy with this solution. Maybe you could try something similar.

            1 Reply Last reply
            0
            • R RichardGrimmer

              After suggestions for algorithms, products or anything else in the arena of high performance name matching. Background - we allow our clients to upload a list of names to our systems, and then we screen those names against a list we hold on our side looking for matches. Ballpark figures are ~500K uploaded records being compared against >5M on our side. We've currently got a couple of ways of doing this, from a Sql DB & Entity Framework (no giggling at the back), to a full on in-memory dictionary with all possible parts of every name broken out and indexed (I said no giggling). I'm looking for high performance, high accuracy matching, and it would be nice if I could also do things like distance matching (probably using Levenstein), but the core matching needs to be as fast as possible. I'm considering an in-memory Sql instance (not ideal), Apache Solr (not sure, never used it) etc and I was wondering if any of you had experience and could make some recommendations

              C# has already designed away most of the tedium of C++.

              N Offline
              N Offline
              Nathan Minier
              wrote on last edited by
              #6

              I know that I'm late to the party, but if you're just checking for the existence of the names why don't you run it against an in-memory hash array? Bootstrap it from your database and order it, then you should just have to hash the records and run a binary search with that. Rainbow tables can be used for good, as well as evil!

              1 Reply Last reply
              0
              Reply
              • Reply as topic
              Log in to reply
              • Oldest to Newest
              • Newest to Oldest
              • Most Votes


              • Login

              • Don't have an account? Register

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • World
              • Users
              • Groups