Long-term project on text analysis

Fernando A Gomez F

Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

wizardzz

I hope you're good at regular expressions! j/k ;)

DaveAuld

Ask google, they know how to find a cat now. :-D

Dave Find Me On: Web|Facebook|Twitter|LinkedIn

Folding Stats: Team CodeProject

DaveAuld

Joking aside, you should maybe look through some of Googles White Papers, they might give you some inspiration on how to tackle the problem. http://research.google.com/pubs/papers.html[^]

Dave Find Me On: Web|Facebook|Twitter|LinkedIn

Folding Stats: Team CodeProject

Fernando A Gomez F

Heh, yeh, figured that! :)

Fernando A Gomez F

Yes, I'm Googling bunches of information on that regard. Still, much info at once.

Fernando A Gomez F

Nice, thanks a lot! That's a good place to start!

Paul M Watt

As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

All of my software is powered by a single Watt.

Karl Sanford

Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

Be The Noise

Fernando A Gomez F

Yes, thanks for the link! Already reading it!

Fernando A Gomez F

Thanks Karl! About the distance. When a given token is found (let's say, a noun), the distance to a second token is how many tokens (or sentences or paragraphs) are between them. I'm thinking it in terms of distance between (graph) nodes. As for the library, I'll take a look at it, thanks!

S Douglas

If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.

Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

Garth J Lancaster

sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

User 7901217

If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

aule browser

Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

User 7901217

The term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.

Fernando A Gomez F

Hi Doubls, It is not for a company website selling goods, but rather to create a tool that allows customers to trackin social media or web sites how specific information is being perceived. Although I've been reading loads of info on data mining; I'll take your recommendation for further references! Thanks again.

Fernando A Gomez F

Yes, I see I need to build my thesaurus, or buy one. Already emailed the Royal Spanish Academy to see whether they sell their dictionary with thesaurus electronically. As for the stopword list, yes, it's gonna be painful to build one up. :-) Thanks!

Fernando A Gomez F

Hi, thanks a lot for the book, I will get it for sure! As for using existing tools. We do use existing tools. We have developed a small Pipeline for FAST Search Server. FAST does most of the work we require and goes with our SharePoint-aligned business offerings. This work well with big international firms, such as banks. But we've left out a strong market share: marketing and research firms, small and medium companies, which (in most cases) simply can't afford to have FAST or SharePoint. We're already partnering with an American firm, they have a really neat software for sentiment analysis, so that's going to be an option as well. We're in an evaluation stage, trying to come up with the best business plan so we can attack this market share. We have to study as many options as possible, and one of these would be to create our own tool. I do agree with you, however, and think the best thing would be to use existing tools. I hope it won't come down to building from zero... Thanks a lot for your comments, the book and the videos! Best regards.

Fernando A Gomez F

Wow, never heard any of those tools! I'll take a look at them, thanks a lot!