Spell Checker - array list speed improvements

Steve Driessens

G'day Peter, I know next to nothing about spell checker implementation and you may already know about this stuff, but here goes... I suspect that some spell checkers may use a Soundex[1] algorithm (or variation thereof) to encode their master word list. When checking the spelling on an individiual word, you'd generate a soundex code for the check word, and look for words with matching soundex codes in your master word list. This soundex comparison search may turn up lots of 'hits', but you could then refine matters bu searching the soundex hit list with a more exact word comparison. [1] Soundex encodes the 'sound' of words by sripping out the vowels and representing similar sounding consonants with numerics (i.e 'd' and 't' are represented by the numeral 3) . A search on Google for 'soundex algotithm' will turn up plenty of pages detailing the original Soundex algorithm and several refined versions. Hope this helps, Steve -------------------------------------- Steve Driessens

Wilka

Programming Pearls 2nd (ISBN 0201657880) has a section that will probably be useful for this (column 13), you should probably take a look if you can.

Michael Dunn

Is your array of words sorted? If not, then nothing you do will help the speed, because every time you search the list, you'll have to do a brute-force search. If you do have the words ordered, put them in a binary tree instead of a flat array. That greatly speeds up searches, and it's easy to add new words, whereas it's _very_ expensive to add something in the middle of an ordered array. I wrote a spell checker for myself back in college to teach myself some C++, but it just did lookups, not suggestions or letter-matching like yours. I used a simple hash table, but a hash table isn't ordered, so that wouldn't be a good solution for you. As an aside, if you have access to a Unix system, grab its /usr/dict/words file - it's a word list. :) I'm sure there are also word lists out on the net, if you look (or search Google). --Mike-- http://home.inreach.com/mdunn/ "That probably would've sounded more commanding if I wasn't wearing my yummy sushi pajamas." --Buffy

Chris Maunder

You beat me by seconds to the "use a tree!" comment. Ah well... Posting the spell checker to CodeProject would be way cool. If you make it clear it's a work in progress you will be surprised at the number of readers who will grab the code and run with it to get it up to spec. cheers, Chris Maunder

Peter Pearson

Yes, it is sorted, but the suggestion code I'm using is crap, and it needs completely re-writing. I've submitted an article to CodeProject, there's a fair amount of documentation there, and I've also included the dictionary file. I got most of the dictionary from Unix versions of open source spell checker libraries actually, but I've checked through most of it and added quite a few words. Cheers, Peter

David Cunningham

>>I got most of the dictionary from Unix versions of open source spell checker libraries actually Are you sure you're not violating the license for the opensource project?

Peter Pearson

Not that I know of, but It's worth a check. Regards, Peter

Peter Pearson

Not that I know of, but It's worth a check. Regards, Peter

Peter Pearson

Right. I got the majority of the words from here: http://dosware.nfo.sk/txtms02.htm#words but I took quite a few words out that were specific or that I couldn't find elsewhere. The readme file it came with contains this: Dallas, TX 6-Sep-91 This is a list of over 100,000 English words transcribed orthographically. I obtained it from The Interociter bulletin board in Dallas (214/258-1832). The original read.me file said that the list came from Public Brand Software. The original list contained 146,440 words, but I discovered that there were thousands of duplicate words. I resorted the list and removed the duplicates using the Unix utility uniq. The total number of words is now 109,582. I have repackaged the list into four files (the original was five): File Bytes Words Range --------- ------ ----- ----- words1.lst 315376 29839 A-D words2.lst 242484 23101 E-K words3.lst 325716 30439 L-R words4.lst 270759 26203 S-Z ---------------- Total 1154335 109582 This word list includes inflected forms, such as plural nouns and the -s, -ed and -ing forms of verbs. Thus the number of lexical stems represented in the list is considerably smaller than the total number of words. Evan Antworth Academic Computing Department Summer Institute of Linguistics 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Internet: evan@sil.org UUCP: ...!uunet!convex!txsil!evan phone: 214/709-2418 fax: 214/709-3387 ---- Do you recon it's okay to use/post a dictionary I've compiled mainly from this source or not? It doesn't say anything about not using them, it even says it was taken from another source in the first place. Regards, Peter

Peter Pearson

Hi, Thanks for the information - I notice that Matt Gullett also posted an article on using this type of algorithm in spell checkers, so I'll have a look around. Regards, Peter