Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Algorithms
  4. Finding Matches in Two Large Lists of Strings

Finding Matches in Two Large Lists of Strings

Scheduled Pinned Locked Moved Algorithms
tutorialquestion
6 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G Offline
    G Offline
    ggraham412
    wrote on last edited by
    #1

    Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

    A S M P T 5 Replies Last reply
    0
    • G ggraham412

      Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

      A Offline
      A Offline
      Alan Balkany
      wrote on last edited by
      #2

      There's a much simpler solution: Use a hash table. Insert all strings from the larger list into a hash table with at least twice as many slots as strings. Then check to see if each string from the smaller list is in the hash table. Hash tables trade space for speed and are very fast. If you want more speed, just increase the number of slots in the table.

      1 Reply Last reply
      0
      • G ggraham412

        Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

        S Offline
        S Offline
        supercat9
        wrote on last edited by
        #3

        I'm not sure there's a good method if one is seeking to find all strings in one set which are within some Leven distance of any string in another set, unless the only distances of interest are small (preferably one, maybe two). If the strings in question are names, you could hash each one into an 26-bit integer identifying what letters it contains. If desired, some letters could be ignored, or other letters folded together, so as to yield e.g. a 16-bit integer. If two names are within an edit distance of e.g. 2, then there must be at most two bits different between the names. One could produce a list of all distance-two collisions among the 65,536 different integer hash values. There would be many such collisions, but one might still save some time versus a brute-force comparison of everything.

        1 Reply Last reply
        0
        • G ggraham412

          Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

          M Offline
          M Offline
          Moreno Airoldi
          wrote on last edited by
          #4

          As others said, I really think a hash table is the fastest solution to your problem. Not sure if there is some more specific approach for this case. :)

          2+2=5 for very large amounts of 2 (always loved that one hehe!)

          1 Reply Last reply
          0
          • G ggraham412

            Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #5

            It's unclear. Do you have two lists of names (List<string>)? Or two strings that contain multiple names? If the former, I'd make one into a HashSet and then see if the members of the other appear in it. If the latter, I'd Split them and proceed as above.

            1 Reply Last reply
            0
            • G ggraham412

              Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!

              T Offline
              T Offline
              Tadeusz Westawic
              wrote on last edited by
              #6

              Way, way too long ago I worked for a state DMV and they used something called the "Soundex Code" to hash surnames into 5-character alpha-num strings. Similar-sounding names cluster about the same hash values. U can find the algorithm on wiki.

              Tadeusz Westawic An ounce of Clever is worth a pound of Experience.

              1 Reply Last reply
              0
              Reply
              • Reply as topic
              Log in to reply
              • Oldest to Newest
              • Newest to Oldest
              • Most Votes


              • Login

              • Don't have an account? Register

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • World
              • Users
              • Groups