Hi, I have two long (>50000) lists of names that must be periodically checked for possible matches between them. I've coded up several fast algorithms for computing edit distances (Eg- Levenshtein) but the size of the lists still makes it very time costly. Is there any fast function F(S) that people compute offline on single strings that you could cluster them by, so that two strings far apart in F are also far apart in string distance, and one could reduce the size of the set that must be checked exactly? For example, if my criterion for matching is Leven(s1, s2) < N, then I know that |Length(s1) - Length(s2)| < N, and if I find a length difference >= N I won't even bother running Levenshtein. I Googled once and came up with suggestions to use Hilbert curves or Z-ordering, and it made my head hurt. But there's gotta be something better than just length... Thanks a million!
ggraham412
Posts
-
Finding Matches in Two Large Lists of Strings -
Hard to believe this was in the Wall Street JournalWould you bank somewhere if you knew bank employees were uploading your mortgage application to Google so they could work on it at home over the weekend? Those "pricks" in the IT department are often concerned with implementing policies that protect customers' personal information, avoiding bad press for the company, and doing it in a way that is both efficient and verifiable to internal auditors. Do the "user's needs" you allude to above include a regular paycheck along with unfettered net usage for geniuses at work? Sorry, but as a developer in a small IT department, I get to see both sides. And the WSJ is incredibly irresponsible not simply because it puts ideas in the heads of non-experts, but because it brings said ideas down to the level of "common advice" and "everyone's doing it". I think many people who had a toe in the water before will be encouraged to go all in after reading it. Given that most non-technical users can't even configure LimeWire correctly so that it doesn't share every single file on their system (as widely reported recently) why would anyone publish a non-technical advertisement for using proxies, file sharing services, etc?
-
City Politics --- think of the children!Yeah, A local lumberyard near my house wanted to expand, but they needed a zoning change to do it. Adjacent homeowners complained, but we were completely disorganized and made a hash out of the arguments, which ranged from the reasonable "We're OK with it if you modify the proposal to minimize impact, etc." to the hysterical "You're hurting our children and ruining our neighborhood!" After the homeowners had their say at a town meeting, the lumberyard owners stood up to say that if they didn't get all of the requested zoning variances, they were leaving town. Guess what? They got all of the variances with no changes. So much for equal protection of the law. And our tax bill went up.
-
Best C++ Book to get?Those are all great books and have a place in my library, so I'm going to mention an oddball choice: C+C++: Programming With Objects in C and C++ by Allen Holub (McGraw-Hill, 1992). If you happen to have a background in C, this is a great choice because it actually goes into a lot of the details of what is actually happening in a C++ program from a C perspective. A little dated perhaps, but take a look.