User 7901217
Posts
-
How to deal with annoying idiots -
Long-term project on text analysisThe term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.
-
Long-term project on text analysisIf this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.