Configurable Webpage-scrapper with learning capabilities
-
I like "The Code Project" site. A lot of articles have information and code that i don't need directly, but i already have a purpose for them in the future. So i like to keep those articles on my localdrive. The proces of storing all the wanted data of the article in the format and location i like is rather timeconsuming. -First i create a Worddocument -then i start scraping the header of the article from the screen and paste it in this Worddocument -then i scrape the content of the article from the top inluding downloadlinks to the bottom just after the author's profile (i dont want links of other interresting articles in this document) and paste it in this Worddocument. -then i save the worddocument in a subdirectory at a predefined location with the name of the title of the article (the subdirectory is also named after the title of the article) -then i start downloading the files on the top if any are included and place them in the subdirectory mentioned earlier. This problem is the same with other websites, mostly the way they present the data is different. Wouldn't it be great to have a tool where you can automate this task after you first explain the tool how to gather the data from a particular webpage, in which format and where to store it. It's pretty much how they program a robot in the car industry that has to paint the chassis of a car model (once programmed the robot does it exactly the same every time for that car model). The tool should eventually recognize if a presented page has been scraped before or that something has changed! The tool should also try to scrape the name of the files from the screen and use them instead of the suggested files in the download dialog. I would very much like to participate in a project for such a tool. Zitniet