Long-term project on text analysis

Fernando A Gomez F

Nice, thanks a lot! That's a good place to start!

Paul M Watt

As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

All of my software is powered by a single Watt.

Karl Sanford

Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

Be The Noise

Fernando A Gomez F

Yes, thanks for the link! Already reading it!

Fernando A Gomez F

Thanks Karl! About the distance. When a given token is found (let's say, a noun), the distance to a second token is how many tokens (or sentences or paragraphs) are between them. I'm thinking it in terms of distance between (graph) nodes. As for the library, I'll take a look at it, thanks!

S Douglas

If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.

Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

Garth J Lancaster

sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

User 7901217

If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

aule browser

Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

User 7901217

The term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.

Fernando A Gomez F

Hi Doubls, It is not for a company website selling goods, but rather to create a tool that allows customers to trackin social media or web sites how specific information is being perceived. Although I've been reading loads of info on data mining; I'll take your recommendation for further references! Thanks again.

Fernando A Gomez F

Yes, I see I need to build my thesaurus, or buy one. Already emailed the Royal Spanish Academy to see whether they sell their dictionary with thesaurus electronically. As for the stopword list, yes, it's gonna be painful to build one up. :-) Thanks!

Fernando A Gomez F

Hi, thanks a lot for the book, I will get it for sure! As for using existing tools. We do use existing tools. We have developed a small Pipeline for FAST Search Server. FAST does most of the work we require and goes with our SharePoint-aligned business offerings. This work well with big international firms, such as banks. But we've left out a strong market share: marketing and research firms, small and medium companies, which (in most cases) simply can't afford to have FAST or SharePoint. We're already partnering with an American firm, they have a really neat software for sentiment analysis, so that's going to be an option as well. We're in an evaluation stage, trying to come up with the best business plan so we can attack this market share. We have to study as many options as possible, and one of these would be to create our own tool. I do agree with you, however, and think the best thing would be to use existing tools. I hope it won't come down to building from zero... Thanks a lot for your comments, the book and the videos! Best regards.

Fernando A Gomez F

Wow, never heard any of those tools! I'll take a look at them, thanks a lot!

Fernando A Gomez F

Ah, now I get that term. So, you're telling me that I should be aware on which domain the word is located? That would be nice, I'll take it into consideration, thanks! For the synonyms, I was thinking on acquiring a thesaurus, but I guess I'll have to build my own domain-specific ones... Best regards!

SeattleC

Yeah, I'm pretty sure this is a solved (or at least worked-on) problem, even though I have no particular idea how to solve it. It's related to automatically generating indexes in book text, to natural language understanding, to ontology building. I think there are people solving the problem directly too.

smcnulty2000

Goodness. There's a lot of things to say about this. Stopwords and real words: There are lists of words all over the place. If you are working in english then you can examine the following sites. http://en.wikipedia.org/wiki/Moby_Project[^] http://www.paulnoll.com/Books/Clear-English/English-3000-common-words.html[^] Wordnet http://wordnet.princeton.edu/[^] http://en.wikipedia.org/wiki/Brown_Corpus[^] If you want to get the whole subject list of wikipedia you can try a site like this: http://en.wikipedia.org/wiki/Wikipedia:Database\_download http://dumps.wikimedia.org/enwiktionary/20120125/ The last set was given to me by someone at CP for a similar but personal project. My thought on the wiki dump was that you could at least get a listing of real, genuine subjects so you know that if this term appears (even if it is multi-word) then it can be extracted and analyzed separately. Synonyms: The above sites include some ability to see synonyms, especially the wordnet list, which requires a little parsing but has some columns for knowing how some words fit together. As I recall it has concept numbers that associate the words together. In general, I find it best to use an "authority" word (or number) to lead all the related words back to. So if you had basin, creek, inland sea, lagoon, lakelet, loch, mere, millpond, mouth, pond, pool, reservoir, sluice, spring, tarn, basin, cistern, lake , pond, pool, receptacle or spring they would all lead to "LAKE". But the trick is knowing which way someone meant a word like "spring". If you figure that out you might get famous. You might want to look around to see if anyone has done a writeup on De Jong's FRUMP program from 1982. The kinds of items you will be analyzing might fall into a similar structure to how he did it in his programs. As I recall his system would categorize news stories into different buc

Fernando A Gomez F

Hi, thanks a lot for your complete input! Currently we're working on Spanish language, so we're already in contact with Spain's Royal Academy of the Language, trying to acquire their several dictionaries. Currently, they have the traditional run of the mill dictionary online[^], but we're trying to get synonyms and antonyms and such. Although we never considered Wikipedia, I think we'll give it a try, although the Spanish version is not as good as the English one. The authority word seems a good idea. I've been thinking on creating graphs for a given word, so you could navigate from a word to the synonyms (graphs in direct contact with the word) or alike words (graphs in contact with the synonym graphs), and such. The further a graph is from another graph, the less related they are. As for the storage, I'm not sure whether a database would be a good option. At least for the first release, we're considering using our own file systems. If this task proves either too difficult or time-consuming, we'll move to a DB engine. I'll take a look at the one you've mentioned. Also, I've seen folks recommending a NoSQL, when dealing with much information, as the one that would come from web crawling or media monitoring. All in all, I think that, at this point, the project seems doable. Let's see if we get the funding. Thanks again for your help! Best regards.

smcnulty2000

I wish you luck on this. Progress in this area is good for everyone. MongoDB is a NoSQL database, so that advice is in agreement.

_____________________________ A logician deducts the truth. A detective inducts the truth. A journalist abducts the truth. Give a man a mug, he drinks for a day. Teach a man to mug...

Fernando A Gomez F

Thanks! I'll let you know what we come up with. Plus, that MongoDB looks promising!