Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Long-term project on text analysis

Long-term project on text analysis

Scheduled Pinned Locked Moved The Lounge
databasesysadminbusinesstoolstutorial
26 Posts 11 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D DaveAuld

    Joking aside, you should maybe look through some of Googles White Papers, they might give you some inspiration on how to tackle the problem. http://research.google.com/pubs/papers.html[^]

    Dave Find Me On: Web|Facebook|Twitter|LinkedIn


    Folding Stats: Team CodeProject

    F Offline
    F Offline
    Fernando A Gomez F
    wrote on last edited by
    #7

    Nice, thanks a lot! That's a good place to start!

    1 Reply Last reply
    0
    • F Fernando A Gomez F

      Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

      P Offline
      P Offline
      Paul M Watt
      wrote on last edited by
      #8

      As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

      All of my software is powered by a single Watt.

      F 1 Reply Last reply
      0
      • F Fernando A Gomez F

        Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

        K Offline
        K Offline
        Karl Sanford
        wrote on last edited by
        #9

        Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

        Be The Noise

        F 1 Reply Last reply
        0
        • P Paul M Watt

          As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

          All of my software is powered by a single Watt.

          F Offline
          F Offline
          Fernando A Gomez F
          wrote on last edited by
          #10

          Yes, thanks for the link! Already reading it!

          1 Reply Last reply
          0
          • K Karl Sanford

            Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

            Be The Noise

            F Offline
            F Offline
            Fernando A Gomez F
            wrote on last edited by
            #11

            Thanks Karl! About the distance. When a given token is found (let's say, a noun), the distance to a second token is how many tokens (or sentences or paragraphs) are between them. I'm thinking it in terms of distance between (graph) nodes. As for the library, I'll take a look at it, thanks!

            1 Reply Last reply
            0
            • F Fernando A Gomez F

              Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

              S Offline
              S Offline
              S Douglas
              wrote on last edited by
              #12

              If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.


              Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

              F 1 Reply Last reply
              0
              • F Fernando A Gomez F

                Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                G Offline
                G Offline
                Garth J Lancaster
                wrote on last edited by
                #13

                sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

                F 1 Reply Last reply
                0
                • F Fernando A Gomez F

                  Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                  U Offline
                  U Offline
                  User 7901217
                  wrote on last edited by
                  #14

                  If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

                  F S 2 Replies Last reply
                  0
                  • F Fernando A Gomez F

                    Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                    A Offline
                    A Offline
                    aule browser
                    wrote on last edited by
                    #15

                    Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

                    F 1 Reply Last reply
                    0
                    • F Fernando A Gomez F

                      Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                      U Offline
                      U Offline
                      User 7901217
                      wrote on last edited by
                      #16

                      The term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.

                      F 1 Reply Last reply
                      0
                      • S S Douglas

                        If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.


                        Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

                        F Offline
                        F Offline
                        Fernando A Gomez F
                        wrote on last edited by
                        #17

                        Hi Doubls, It is not for a company website selling goods, but rather to create a tool that allows customers to trackin social media or web sites how specific information is being perceived. Although I've been reading loads of info on data mining; I'll take your recommendation for further references! Thanks again.

                        1 Reply Last reply
                        0
                        • G Garth J Lancaster

                          sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

                          F Offline
                          F Offline
                          Fernando A Gomez F
                          wrote on last edited by
                          #18

                          Yes, I see I need to build my thesaurus, or buy one. Already emailed the Royal Spanish Academy to see whether they sell their dictionary with thesaurus electronically. As for the stopword list, yes, it's gonna be painful to build one up. :-) Thanks!

                          1 Reply Last reply
                          0
                          • U User 7901217

                            If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

                            F Offline
                            F Offline
                            Fernando A Gomez F
                            wrote on last edited by
                            #19

                            Hi, thanks a lot for the book, I will get it for sure! As for using existing tools. We do use existing tools. We have developed a small Pipeline for FAST Search Server. FAST does most of the work we require and goes with our SharePoint-aligned business offerings. This work well with big international firms, such as banks. But we've left out a strong market share: marketing and research firms, small and medium companies, which (in most cases) simply can't afford to have FAST or SharePoint. We're already partnering with an American firm, they have a really neat software for sentiment analysis, so that's going to be an option as well. We're in an evaluation stage, trying to come up with the best business plan so we can attack this market share. We have to study as many options as possible, and one of these would be to create our own tool. I do agree with you, however, and think the best thing would be to use existing tools. I hope it won't come down to building from zero... Thanks a lot for your comments, the book and the videos! Best regards.

                            1 Reply Last reply
                            0
                            • A aule browser

                              Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

                              F Offline
                              F Offline
                              Fernando A Gomez F
                              wrote on last edited by
                              #20

                              Wow, never heard any of those tools! I'll take a look at them, thanks a lot!

                              1 Reply Last reply
                              0
                              • U User 7901217

                                The term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.

                                F Offline
                                F Offline
                                Fernando A Gomez F
                                wrote on last edited by
                                #21

                                Ah, now I get that term. So, you're telling me that I should be aware on which domain the word is located? That would be nice, I'll take it into consideration, thanks! For the synonyms, I was thinking on acquiring a thesaurus, but I guess I'll have to build my own domain-specific ones... Best regards!

                                1 Reply Last reply
                                0
                                • U User 7901217

                                  If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

                                  S Offline
                                  S Offline
                                  SeattleC
                                  wrote on last edited by
                                  #22

                                  Yeah, I'm pretty sure this is a solved (or at least worked-on) problem, even though I have no particular idea how to solve it. It's related to automatically generating indexes in book text, to natural language understanding, to ontology building. I think there are people solving the problem directly too.

                                  1 Reply Last reply
                                  0
                                  • F Fernando A Gomez F

                                    Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                                    S Offline
                                    S Offline
                                    smcnulty2000
                                    wrote on last edited by
                                    #23

                                    Goodness. There's a lot of things to say about this. Stopwords and real words: There are lists of words all over the place. If you are working in english then you can examine the following sites. http://en.wikipedia.org/wiki/Moby_Project[^] http://www.paulnoll.com/Books/Clear-English/English-3000-common-words.html[^] Wordnet http://wordnet.princeton.edu/[^] http://en.wikipedia.org/wiki/Brown_Corpus[^] If you want to get the whole subject list of wikipedia you can try a site like this: http://en.wikipedia.org/wiki/Wikipedia:Database\_download http://dumps.wikimedia.org/enwiktionary/20120125/ The last set was given to me by someone at CP for a similar but personal project. My thought on the wiki dump was that you could at least get a listing of real, genuine subjects so you know that if this term appears (even if it is multi-word) then it can be extracted and analyzed separately. Synonyms: The above sites include some ability to see synonyms, especially the wordnet list, which requires a little parsing but has some columns for knowing how some words fit together. As I recall it has concept numbers that associate the words together. In general, I find it best to use an "authority" word (or number) to lead all the related words back to. So if you had basin, creek, inland sea, lagoon, lakelet, loch, mere, millpond, mouth, pond, pool, reservoir, sluice, spring, tarn, basin, cistern, lake , pond, pool, receptacle or spring they would all lead to "LAKE". But the trick is knowing which way someone meant a word like "spring". If you figure that out you might get famous. You might want to look around to see if anyone has done a writeup on De Jong's FRUMP program from 1982. The kinds of items you will be analyzing might fall into a similar structure to how he did it in his programs. As I recall his system would categorize news stories into different buc

                                    F 1 Reply Last reply
                                    0
                                    • S smcnulty2000

                                      Goodness. There's a lot of things to say about this. Stopwords and real words: There are lists of words all over the place. If you are working in english then you can examine the following sites. http://en.wikipedia.org/wiki/Moby_Project[^] http://www.paulnoll.com/Books/Clear-English/English-3000-common-words.html[^] Wordnet http://wordnet.princeton.edu/[^] http://en.wikipedia.org/wiki/Brown_Corpus[^] If you want to get the whole subject list of wikipedia you can try a site like this: http://en.wikipedia.org/wiki/Wikipedia:Database\_download http://dumps.wikimedia.org/enwiktionary/20120125/ The last set was given to me by someone at CP for a similar but personal project. My thought on the wiki dump was that you could at least get a listing of real, genuine subjects so you know that if this term appears (even if it is multi-word) then it can be extracted and analyzed separately. Synonyms: The above sites include some ability to see synonyms, especially the wordnet list, which requires a little parsing but has some columns for knowing how some words fit together. As I recall it has concept numbers that associate the words together. In general, I find it best to use an "authority" word (or number) to lead all the related words back to. So if you had basin, creek, inland sea, lagoon, lakelet, loch, mere, millpond, mouth, pond, pool, reservoir, sluice, spring, tarn, basin, cistern, lake , pond, pool, receptacle or spring they would all lead to "LAKE". But the trick is knowing which way someone meant a word like "spring". If you figure that out you might get famous. You might want to look around to see if anyone has done a writeup on De Jong's FRUMP program from 1982. The kinds of items you will be analyzing might fall into a similar structure to how he did it in his programs. As I recall his system would categorize news stories into different buc

                                      F Offline
                                      F Offline
                                      Fernando A Gomez F
                                      wrote on last edited by
                                      #24

                                      Hi, thanks a lot for your complete input! Currently we're working on Spanish language, so we're already in contact with Spain's Royal Academy of the Language, trying to acquire their several dictionaries. Currently, they have the traditional run of the mill dictionary online[^], but we're trying to get synonyms and antonyms and such. Although we never considered Wikipedia, I think we'll give it a try, although the Spanish version is not as good as the English one. The authority word seems a good idea. I've been thinking on creating graphs for a given word, so you could navigate from a word to the synonyms (graphs in direct contact with the word) or alike words (graphs in contact with the synonym graphs), and such. The further a graph is from another graph, the less related they are. As for the storage, I'm not sure whether a database would be a good option. At least for the first release, we're considering using our own file systems. If this task proves either too difficult or time-consuming, we'll move to a DB engine. I'll take a look at the one you've mentioned. Also, I've seen folks recommending a NoSQL, when dealing with much information, as the one that would come from web crawling or media monitoring. All in all, I think that, at this point, the project seems doable. Let's see if we get the funding. Thanks again for your help! Best regards.

                                      S 1 Reply Last reply
                                      0
                                      • F Fernando A Gomez F

                                        Hi, thanks a lot for your complete input! Currently we're working on Spanish language, so we're already in contact with Spain's Royal Academy of the Language, trying to acquire their several dictionaries. Currently, they have the traditional run of the mill dictionary online[^], but we're trying to get synonyms and antonyms and such. Although we never considered Wikipedia, I think we'll give it a try, although the Spanish version is not as good as the English one. The authority word seems a good idea. I've been thinking on creating graphs for a given word, so you could navigate from a word to the synonyms (graphs in direct contact with the word) or alike words (graphs in contact with the synonym graphs), and such. The further a graph is from another graph, the less related they are. As for the storage, I'm not sure whether a database would be a good option. At least for the first release, we're considering using our own file systems. If this task proves either too difficult or time-consuming, we'll move to a DB engine. I'll take a look at the one you've mentioned. Also, I've seen folks recommending a NoSQL, when dealing with much information, as the one that would come from web crawling or media monitoring. All in all, I think that, at this point, the project seems doable. Let's see if we get the funding. Thanks again for your help! Best regards.

                                        S Offline
                                        S Offline
                                        smcnulty2000
                                        wrote on last edited by
                                        #25

                                        I wish you luck on this. Progress in this area is good for everyone. MongoDB is a NoSQL database, so that advice is in agreement.

                                        _____________________________ A logician deducts the truth. A detective inducts the truth. A journalist abducts the truth. Give a man a mug, he drinks for a day. Teach a man to mug...

                                        F 1 Reply Last reply
                                        0
                                        • S smcnulty2000

                                          I wish you luck on this. Progress in this area is good for everyone. MongoDB is a NoSQL database, so that advice is in agreement.

                                          _____________________________ A logician deducts the truth. A detective inducts the truth. A journalist abducts the truth. Give a man a mug, he drinks for a day. Teach a man to mug...

                                          F Offline
                                          F Offline
                                          Fernando A Gomez F
                                          wrote on last edited by
                                          #26

                                          Thanks! I'll let you know what we come up with. Plus, that MongoDB looks promising!

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups