Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Long-term project on text analysis

Long-term project on text analysis

Scheduled Pinned Locked Moved The Lounge
databasesysadminbusinesstoolstutorial
26 Posts 11 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F Offline
    F Offline
    Fernando A Gomez F
    wrote on last edited by
    #1

    Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

    W D P K S 11 Replies Last reply
    0
    • F Fernando A Gomez F

      Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

      W Offline
      W Offline
      wizardzz
      wrote on last edited by
      #2

      I hope you're good at regular expressions! j/k ;)

      F 1 Reply Last reply
      0
      • F Fernando A Gomez F

        Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

        D Offline
        D Offline
        DaveAuld
        wrote on last edited by
        #3

        Ask google, they know how to find a cat now. :-D

        Dave Find Me On: Web|Facebook|Twitter|LinkedIn


        Folding Stats: Team CodeProject

        F 1 Reply Last reply
        0
        • F Fernando A Gomez F

          Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

          D Offline
          D Offline
          DaveAuld
          wrote on last edited by
          #4

          Joking aside, you should maybe look through some of Googles White Papers, they might give you some inspiration on how to tackle the problem. http://research.google.com/pubs/papers.html[^]

          Dave Find Me On: Web|Facebook|Twitter|LinkedIn


          Folding Stats: Team CodeProject

          F 1 Reply Last reply
          0
          • W wizardzz

            I hope you're good at regular expressions! j/k ;)

            F Offline
            F Offline
            Fernando A Gomez F
            wrote on last edited by
            #5

            Heh, yeh, figured that! :)

            1 Reply Last reply
            0
            • D DaveAuld

              Ask google, they know how to find a cat now. :-D

              Dave Find Me On: Web|Facebook|Twitter|LinkedIn


              Folding Stats: Team CodeProject

              F Offline
              F Offline
              Fernando A Gomez F
              wrote on last edited by
              #6

              Yes, I'm Googling bunches of information on that regard. Still, much info at once.

              1 Reply Last reply
              0
              • D DaveAuld

                Joking aside, you should maybe look through some of Googles White Papers, they might give you some inspiration on how to tackle the problem. http://research.google.com/pubs/papers.html[^]

                Dave Find Me On: Web|Facebook|Twitter|LinkedIn


                Folding Stats: Team CodeProject

                F Offline
                F Offline
                Fernando A Gomez F
                wrote on last edited by
                #7

                Nice, thanks a lot! That's a good place to start!

                1 Reply Last reply
                0
                • F Fernando A Gomez F

                  Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                  P Offline
                  P Offline
                  Paul M Watt
                  wrote on last edited by
                  #8

                  As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

                  All of my software is powered by a single Watt.

                  F 1 Reply Last reply
                  0
                  • F Fernando A Gomez F

                    Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                    K Offline
                    K Offline
                    Karl Sanford
                    wrote on last edited by
                    #9

                    Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

                    Be The Noise

                    F 1 Reply Last reply
                    0
                    • P Paul M Watt

                      As other CPians have said "Look to Google". What you are describing sounds like PageRank, Googles original search algorithm. It has had many proprietary tweaks to it since then. You can find the high level overview, and some very detailed equations that describe how to rank the closest links: http://en.wikipedia.org/wiki/PageRank[^] Regards, Paul

                      All of my software is powered by a single Watt.

                      F Offline
                      F Offline
                      Fernando A Gomez F
                      wrote on last edited by
                      #10

                      Yes, thanks for the link! Already reading it!

                      1 Reply Last reply
                      0
                      • K Karl Sanford

                        Ultimately, this is a natural language processing problem. The steps you have listed are pretty much the very basic steps in all NLP systems. Although I'm confused on your step 4 metric, as distance isn't necessarily important, you would want to factor in the parts-of-speech in some way. Possibly a distance metric from a POS parse tree? A library that helped me quite a bit with getting some of the basics out of the way is a Python library called NLTK[^].

                        Be The Noise

                        F Offline
                        F Offline
                        Fernando A Gomez F
                        wrote on last edited by
                        #11

                        Thanks Karl! About the distance. When a given token is found (let's say, a noun), the distance to a second token is how many tokens (or sentences or paragraphs) are between them. I'm thinking it in terms of distance between (graph) nodes. As for the library, I'll take a look at it, thanks!

                        1 Reply Last reply
                        0
                        • F Fernando A Gomez F

                          Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                          S Offline
                          S Offline
                          S Douglas
                          wrote on last edited by
                          #12

                          If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.


                          Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

                          F 1 Reply Last reply
                          0
                          • F Fernando A Gomez F

                            Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                            G Offline
                            G Offline
                            Garth J Lancaster
                            wrote on last edited by
                            #13

                            sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

                            F 1 Reply Last reply
                            0
                            • F Fernando A Gomez F

                              Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                              U Offline
                              U Offline
                              User 7901217
                              wrote on last edited by
                              #14

                              If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

                              F S 2 Replies Last reply
                              0
                              • F Fernando A Gomez F

                                Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                                A Offline
                                A Offline
                                aule browser
                                wrote on last edited by
                                #15

                                Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

                                F 1 Reply Last reply
                                0
                                • F Fernando A Gomez F

                                  Hello all. This morning I had a meeting with the CEO and he charged me with the task of creating a tool that allow us to perform some text analysis on a given set of data (i.e. web content). The main goal of this tool is to allow our customers to parse some web content or social content (i.e. web sites or twitter) and observe particular keywords or key phrases, and determine for each one what other terms are being associated with. For example, if I look at "Windows 8", I'll probably find "WinRT", "Surface" and "Metro" words nearby in the text. The tool should be able to show, given a set of data, how many times each of the associated phrases are found, and then perform some statistical analysis and generate reports. So far, what I have in mind is something as follows: 1.- Crawl the text source, identifying paragraphs and sentences. 2.- Tokenize each sentence and discard those words found in a stopword list (i.e. articles, conjunctions, etc). 3.- For each token, identify similar terms (not sure if lemmatize is the right term) according to the word's root (i.e. "Programming" should find "Program", "Programme", "Programmed", "Programmer"). 3a.- Optionally, a synonym dictionary could be incorporated so that "House" can be related to "Home", for example. 4.- For each token (including those related on step 3), try to identify nearby tokens and associate them with a particular weight, determined by the distance between both. 5.- Finally, for each token, generate statistical reports and store 'em in the database. So, this I have so far. I will try to get some information on how other tools do similar work (for instance, I know FAST Search Server has similar capabilities) before I jump on a working and business plan. But I also wanted to know CPians' valuable opinions: whether you think this is a right approach or not, if there is a better (or standardized) way of doing it, if I should forget about it altogether, etc. Also, recommendations on what should I learn before doing anything (for instance, I'm guessing I'll have to dig on data mining books and perhaps re-read my A.I. books), and of course books or articles you might want to recommend. All of this would be very welcome and appreciated, and will even get you free beer next time you come and visit :) . So, thanks in advance guys! Best regards.

                                  U Offline
                                  U Offline
                                  User 7901217
                                  wrote on last edited by
                                  #16

                                  The term we use for your step 3 is "stemming". Your 3a is important but almost always domain-specific. In English, "trunk" and "boot" may be synonyms in auto insurance claims processing but not in a retail or zoology context. Synonym identification can often be driven by cluster analysis, e.g. "noise" and "vibration" may or may not cluster together in warranty claim analysis, depending on whether the claim affects the body or the drive train.

                                  F 1 Reply Last reply
                                  0
                                  • S S Douglas

                                    If this a company website selling goods, then google "data mining shopping basket analysis". Lots of info on that topic and how to get it done.


                                    Common sense is admitting there is cause and effect and that you can exert some control over what you understand.

                                    F Offline
                                    F Offline
                                    Fernando A Gomez F
                                    wrote on last edited by
                                    #17

                                    Hi Doubls, It is not for a company website selling goods, but rather to create a tool that allows customers to trackin social media or web sites how specific information is being perceived. Although I've been reading loads of info on data mining; I'll take your recommendation for further references! Thanks again.

                                    1 Reply Last reply
                                    0
                                    • G Garth J Lancaster

                                      sounds like fun Fernando I think the approach is ok - you'll just have to be a bit 'agile' until you figure out what the stopword list is for example, and be prepared to refine them 'g'

                                      F Offline
                                      F Offline
                                      Fernando A Gomez F
                                      wrote on last edited by
                                      #18

                                      Yes, I see I need to build my thesaurus, or buy one. Already emailed the Royal Spanish Academy to see whether they sell their dictionary with thesaurus electronically. As for the stopword list, yes, it's gonna be painful to build one up. :-) Thanks!

                                      1 Reply Last reply
                                      0
                                      • U User 7901217

                                        If this is a business process and not an exercise, leveraging existing tools is going to be far more effective than being good at regular expressions. Industry experience says this kind of thing is best done by a surprisingly unsophisticated word-counting approach rather than AI-style natural language processing. East coast rather than West coast approach. May I suggest my colleague's book? Also my employer's video series on the topic. Programmers are "makers" by nature and our first impulse given any problem (real or imagined) is to build something from molecules. But most business problems are better solved by not creating a redundant mass of high-maintenance code. I'm an old programmer, and the older I get the less I know. But one thing I've learned along the way that keeps paying dividends is to begin my projects with a trip to the library. There may be personal satisfaction in reinventing something badly but anyone over the age of three should be beyond the "I do it MYSELF" stage. There is much more profit in standing on the shoulders of giants.

                                        F Offline
                                        F Offline
                                        Fernando A Gomez F
                                        wrote on last edited by
                                        #19

                                        Hi, thanks a lot for the book, I will get it for sure! As for using existing tools. We do use existing tools. We have developed a small Pipeline for FAST Search Server. FAST does most of the work we require and goes with our SharePoint-aligned business offerings. This work well with big international firms, such as banks. But we've left out a strong market share: marketing and research firms, small and medium companies, which (in most cases) simply can't afford to have FAST or SharePoint. We're already partnering with an American firm, they have a really neat software for sentiment analysis, so that's going to be an option as well. We're in an evaluation stage, trying to come up with the best business plan so we can attack this market share. We have to study as many options as possible, and one of these would be to create our own tool. I do agree with you, however, and think the best thing would be to use existing tools. I hope it won't come down to building from zero... Thanks a lot for your comments, the book and the videos! Best regards.

                                        1 Reply Last reply
                                        0
                                        • A aule browser

                                          Icon is now Object Icon and Unicon (the latest in utf-8) - these are text-oriented expression-based languages with some AI (fail/succeed) and co-expressions. If Red replaces Rebol3 and remains PEG-equivalent, that is a parsing expression-based language. And so is MIT Curl (now from SCSK at curl.com) but that tends to PCRE - but with neat regex tools. Then there is the whole Logtalk for XSB Prolog for RDF ... Robert Fredericton NB Canada where text is in French and English

                                          F Offline
                                          F Offline
                                          Fernando A Gomez F
                                          wrote on last edited by
                                          #20

                                          Wow, never heard any of those tools! I'll take a look at them, thanks a lot!

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups