Spell Checker - array list speed improvements
-
I recently needed a spell checker for one of my applications, and after trying to find one that existed already, DLL, ActiveX etc, and using OLE and Word to do things, I decided to make one myself. I started, but after experimenting, decided I needed a dictionary first to test results accurately, so for the last four months I've been compiling one (almost 75,000 words - UK English). I'm using a CObArray to store the words, and this was fine to begin with until I got to around 15,000 words, and then things slowed down drastically. What I'm basically doing is finding each word in a text box, then going through the dictionary list (the CObArray) and see if it exists, otherwise try and find words that start with the same characters from the left. It's so slow though, it's completely unusable, and the intelligence is ridiculous (I also need to do things like vowel swapping, etc.) I'd just like to know what the best way speed wise is of using an array or list - I've never had any problems before, but that was will arrays < 20,000. I'm using MFC (CStrings and all that), so that's probably half the problem, but even trying chars and strncmp etc. didn't really make that much of a difference. I was going to post the project to CodeProject when I'd finished, but since it's slow, doesn't work, and the code probably needs a complete re-write and I haven't got the expertise in that area, is it worth just submitting the code and project anyway, and everyone has a look and tries to improve it? Regards, Peter Peason
G'day Peter, I know next to nothing about spell checker implementation and you may already know about this stuff, but here goes... I suspect that some spell checkers may use a Soundex[1] algorithm (or variation thereof) to encode their master word list. When checking the spelling on an individiual word, you'd generate a soundex code for the check word, and look for words with matching soundex codes in your master word list. This soundex comparison search may turn up lots of 'hits', but you could then refine matters bu searching the soundex hit list with a more exact word comparison. [1] Soundex encodes the 'sound' of words by sripping out the vowels and representing similar sounding consonants with numerics (i.e 'd' and 't' are represented by the numeral 3) . A search on Google for 'soundex algotithm' will turn up plenty of pages detailing the original Soundex algorithm and several refined versions. Hope this helps, Steve -------------------------------------- Steve Driessens
-
I recently needed a spell checker for one of my applications, and after trying to find one that existed already, DLL, ActiveX etc, and using OLE and Word to do things, I decided to make one myself. I started, but after experimenting, decided I needed a dictionary first to test results accurately, so for the last four months I've been compiling one (almost 75,000 words - UK English). I'm using a CObArray to store the words, and this was fine to begin with until I got to around 15,000 words, and then things slowed down drastically. What I'm basically doing is finding each word in a text box, then going through the dictionary list (the CObArray) and see if it exists, otherwise try and find words that start with the same characters from the left. It's so slow though, it's completely unusable, and the intelligence is ridiculous (I also need to do things like vowel swapping, etc.) I'd just like to know what the best way speed wise is of using an array or list - I've never had any problems before, but that was will arrays < 20,000. I'm using MFC (CStrings and all that), so that's probably half the problem, but even trying chars and strncmp etc. didn't really make that much of a difference. I was going to post the project to CodeProject when I'd finished, but since it's slow, doesn't work, and the code probably needs a complete re-write and I haven't got the expertise in that area, is it worth just submitting the code and project anyway, and everyone has a look and tries to improve it? Regards, Peter Peason
-
I recently needed a spell checker for one of my applications, and after trying to find one that existed already, DLL, ActiveX etc, and using OLE and Word to do things, I decided to make one myself. I started, but after experimenting, decided I needed a dictionary first to test results accurately, so for the last four months I've been compiling one (almost 75,000 words - UK English). I'm using a CObArray to store the words, and this was fine to begin with until I got to around 15,000 words, and then things slowed down drastically. What I'm basically doing is finding each word in a text box, then going through the dictionary list (the CObArray) and see if it exists, otherwise try and find words that start with the same characters from the left. It's so slow though, it's completely unusable, and the intelligence is ridiculous (I also need to do things like vowel swapping, etc.) I'd just like to know what the best way speed wise is of using an array or list - I've never had any problems before, but that was will arrays < 20,000. I'm using MFC (CStrings and all that), so that's probably half the problem, but even trying chars and strncmp etc. didn't really make that much of a difference. I was going to post the project to CodeProject when I'd finished, but since it's slow, doesn't work, and the code probably needs a complete re-write and I haven't got the expertise in that area, is it worth just submitting the code and project anyway, and everyone has a look and tries to improve it? Regards, Peter Peason
Is your array of words sorted? If not, then nothing you do will help the speed, because every time you search the list, you'll have to do a brute-force search. If you do have the words ordered, put them in a binary tree instead of a flat array. That greatly speeds up searches, and it's easy to add new words, whereas it's _very_ expensive to add something in the middle of an ordered array. I wrote a spell checker for myself back in college to teach myself some C++, but it just did lookups, not suggestions or letter-matching like yours. I used a simple hash table, but a hash table isn't ordered, so that wouldn't be a good solution for you. As an aside, if you have access to a Unix system, grab its /usr/dict/words file - it's a word list. :) I'm sure there are also word lists out on the net, if you look (or search Google). --Mike-- http://home.inreach.com/mdunn/ "That probably would've sounded more commanding if I wasn't wearing my yummy sushi pajamas." --Buffy
-
Is your array of words sorted? If not, then nothing you do will help the speed, because every time you search the list, you'll have to do a brute-force search. If you do have the words ordered, put them in a binary tree instead of a flat array. That greatly speeds up searches, and it's easy to add new words, whereas it's _very_ expensive to add something in the middle of an ordered array. I wrote a spell checker for myself back in college to teach myself some C++, but it just did lookups, not suggestions or letter-matching like yours. I used a simple hash table, but a hash table isn't ordered, so that wouldn't be a good solution for you. As an aside, if you have access to a Unix system, grab its /usr/dict/words file - it's a word list. :) I'm sure there are also word lists out on the net, if you look (or search Google). --Mike-- http://home.inreach.com/mdunn/ "That probably would've sounded more commanding if I wasn't wearing my yummy sushi pajamas." --Buffy
You beat me by seconds to the "use a tree!" comment. Ah well... Posting the spell checker to CodeProject would be way cool. If you make it clear it's a work in progress you will be surprised at the number of readers who will grab the code and run with it to get it up to spec. cheers, Chris Maunder
-
Is your array of words sorted? If not, then nothing you do will help the speed, because every time you search the list, you'll have to do a brute-force search. If you do have the words ordered, put them in a binary tree instead of a flat array. That greatly speeds up searches, and it's easy to add new words, whereas it's _very_ expensive to add something in the middle of an ordered array. I wrote a spell checker for myself back in college to teach myself some C++, but it just did lookups, not suggestions or letter-matching like yours. I used a simple hash table, but a hash table isn't ordered, so that wouldn't be a good solution for you. As an aside, if you have access to a Unix system, grab its /usr/dict/words file - it's a word list. :) I'm sure there are also word lists out on the net, if you look (or search Google). --Mike-- http://home.inreach.com/mdunn/ "That probably would've sounded more commanding if I wasn't wearing my yummy sushi pajamas." --Buffy
Yes, it is sorted, but the suggestion code I'm using is crap, and it needs completely re-writing. I've submitted an article to CodeProject, there's a fair amount of documentation there, and I've also included the dictionary file. I got most of the dictionary from Unix versions of open source spell checker libraries actually, but I've checked through most of it and added quite a few words. Cheers, Peter
-
Yes, it is sorted, but the suggestion code I'm using is crap, and it needs completely re-writing. I've submitted an article to CodeProject, there's a fair amount of documentation there, and I've also included the dictionary file. I got most of the dictionary from Unix versions of open source spell checker libraries actually, but I've checked through most of it and added quite a few words. Cheers, Peter
>>I got most of the dictionary from Unix versions of open source spell checker libraries actually Are you sure you're not violating the license for the opensource project?
-
>>I got most of the dictionary from Unix versions of open source spell checker libraries actually Are you sure you're not violating the license for the opensource project?
Not that I know of, but It's worth a check. Regards, Peter
-
>>I got most of the dictionary from Unix versions of open source spell checker libraries actually Are you sure you're not violating the license for the opensource project?
Not that I know of, but It's worth a check. Regards, Peter
-
>>I got most of the dictionary from Unix versions of open source spell checker libraries actually Are you sure you're not violating the license for the opensource project?
Right. I got the majority of the words from here: http://dosware.nfo.sk/txtms02.htm#words but I took quite a few words out that were specific or that I couldn't find elsewhere. The readme file it came with contains this: Dallas, TX 6-Sep-91 This is a list of over 100,000 English words transcribed orthographically. I obtained it from The Interociter bulletin board in Dallas (214/258-1832). The original read.me file said that the list came from Public Brand Software. The original list contained 146,440 words, but I discovered that there were thousands of duplicate words. I resorted the list and removed the duplicates using the Unix utility uniq. The total number of words is now 109,582. I have repackaged the list into four files (the original was five): File Bytes Words Range --------- ------ ----- ----- words1.lst 315376 29839 A-D words2.lst 242484 23101 E-K words3.lst 325716 30439 L-R words4.lst 270759 26203 S-Z ---------------- Total 1154335 109582 This word list includes inflected forms, such as plural nouns and the -s, -ed and -ing forms of verbs. Thus the number of lexical stems represented in the list is considerably smaller than the total number of words. Evan Antworth Academic Computing Department Summer Institute of Linguistics 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Internet: evan@sil.org UUCP: ...!uunet!convex!txsil!evan phone: 214/709-2418 fax: 214/709-3387 ---- Do you recon it's okay to use/post a dictionary I've compiled mainly from this source or not? It doesn't say anything about not using them, it even says it was taken from another source in the first place. Regards, Peter
-
G'day Peter, I know next to nothing about spell checker implementation and you may already know about this stuff, but here goes... I suspect that some spell checkers may use a Soundex[1] algorithm (or variation thereof) to encode their master word list. When checking the spelling on an individiual word, you'd generate a soundex code for the check word, and look for words with matching soundex codes in your master word list. This soundex comparison search may turn up lots of 'hits', but you could then refine matters bu searching the soundex hit list with a more exact word comparison. [1] Soundex encodes the 'sound' of words by sripping out the vowels and representing similar sounding consonants with numerics (i.e 'd' and 't' are represented by the numeral 3) . A search on Google for 'soundex algotithm' will turn up plenty of pages detailing the original Soundex algorithm and several refined versions. Hope this helps, Steve -------------------------------------- Steve Driessens
Hi, Thanks for the information - I notice that Matt Gullett also posted an article on using this type of algorithm in spell checkers, so I'll have a look around. Regards, Peter