Screen scraping C#
-
I've been working on a screen scraper application using the
WebBrowser
class, for an organisation I am involved with. The basic problem is that I get a screen of data that's presented in table form and use the HTMLElement Innertext item to get the raw data, which I then need to parse to extract the bits I want (actual details are not important). However, I have found that the same screenful of information is passed to me in a slightly different format depending on whether my client PC is running XP, Vista or Windows 7. The content is exactly the same but fields are separated by spaces, \r\n or even \r\n\r\n sequences. Has anyone else come across a similar issue, and if so how did you resolve it?Unrequited desire is character building. OriginalGriff
-
I've been working on a screen scraper application using the
WebBrowser
class, for an organisation I am involved with. The basic problem is that I get a screen of data that's presented in table form and use the HTMLElement Innertext item to get the raw data, which I then need to parse to extract the bits I want (actual details are not important). However, I have found that the same screenful of information is passed to me in a slightly different format depending on whether my client PC is running XP, Vista or Windows 7. The content is exactly the same but fields are separated by spaces, \r\n or even \r\n\r\n sequences. Has anyone else come across a similar issue, and if so how did you resolve it?Unrequited desire is character building. OriginalGriff
My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces, etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
-
My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces, etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
Thanks for the suggestions Luc. 1. I'll take a look at these options, I did not look too closely at existing samples before starting my project. 2. The problem with the main screen is that it is a large table and it looks considerably easier to parse as text rather than going through all the HTML table items. PS: Of course, that is the obvious answer, thanks again.
Unrequited desire is character building. OriginalGriff
-
Thanks for the suggestions Luc. 1. I'll take a look at these options, I did not look too closely at existing samples before starting my project. 2. The problem with the main screen is that it is a large table and it looks considerably easier to parse as text rather than going through all the HTML table items. PS: Of course, that is the obvious answer, thanks again.
Unrequited desire is character building. OriginalGriff
-
Thanks Ravi, I'll certainly have a look at it.
Unrequited desire is character building. OriginalGriff
-
My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces, etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
I have been looking into your suggestions at point 1, but I have a suspicion it will not fit for what I'm doing (my knowledge of Web apps is very weak). The website is a non-public site so my application has to login to get the information I need. As far as I'm aware, session data such as authentication information is held in the browser, so the Request/Response model will not work for me. Or do you perhaps know an answer to this?
Unrequited desire is character building. OriginalGriff
-
I have been looking into your suggestions at point 1, but I have a suspicion it will not fit for what I'm doing (my knowledge of Web apps is very weak). The website is a non-public site so my application has to login to get the information I need. As far as I'm aware, session data such as authentication information is held in the browser, so the Request/Response model will not work for me. Or do you perhaps know an answer to this?
Unrequited desire is character building. OriginalGriff
Richard MacCutchan wrote:
Or do you perhaps know an answer to this?
I always have some kind of an answer. :-D Here I'd say WebBrowser is a high-level Control you can mimick as much as you want based on the lower-level classes I mentioned. I haven't been dealing with session data yet, however I expect it is quite doable. But then it probably doesn't make much sense if WebBrowser offers it for free and you don't have good reasons not to use it. Maybe Test Http endpoints with WebDev.WebServer, NUnit and Salient.Web.HttpLib[^] coukd help you a bit, mind you it is a search result, I didn't read it. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
-
Richard MacCutchan wrote:
Or do you perhaps know an answer to this?
I always have some kind of an answer. :-D Here I'd say WebBrowser is a high-level Control you can mimick as much as you want based on the lower-level classes I mentioned. I haven't been dealing with session data yet, however I expect it is quite doable. But then it probably doesn't make much sense if WebBrowser offers it for free and you don't have good reasons not to use it. Maybe Test Http endpoints with WebDev.WebServer, NUnit and Salient.Web.HttpLib[^] coukd help you a bit, mind you it is a search result, I didn't read it. :)
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
Luc Pattyn wrote:
I always have some kind of an answer.
Exactly why I addressed my question to you. :thumbsup:
Unrequited desire is character building. OriginalGriff
-
My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces, etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).
Luc Pattyn [My Articles] Nil Volentibus Arduum iSad
I took your advice and went for the HTML, and it is considerably easier to parse than the raw text. The data I am using changed again recently and the changes I needed to make to my code were much simpler than if I had stuck with text. Thanks for the tips, I now have a much simpler program to maintain and modify.
Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman
-
I took your advice and went for the HTML, and it is considerably easier to parse than the raw text. The data I am using changed again recently and the changes I needed to make to my code were much simpler than if I had stuck with text. Thanks for the tips, I now have a much simpler program to maintain and modify.
Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman
Look forward to a Tip/Trick on how you parsed the HTML ... with the understanding that the format of your non-public html source may be so unique that what you had to do to parse it just doesn't generalize out to a wider range of scraping/parsing scenarios :) best, Bill
"For no man lives in the external truth among salts and acids, but in the warm, phantasmagoric chamber of his brain, with the painted windows and the storied wall." Robert Louis Stevenson
-
Look forward to a Tip/Trick on how you parsed the HTML ... with the understanding that the format of your non-public html source may be so unique that what you had to do to parse it just doesn't generalize out to a wider range of scraping/parsing scenarios :) best, Bill
"For no man lives in the external truth among salts and acids, but in the warm, phantasmagoric chamber of his brain, with the painted windows and the storied wall." Robert Louis Stevenson
Sorry, but there was nothing special about what I did, just used the DOM tree to get to the elements I needed and pulled the information from it. It's not the HTML that is the issue but how the content is presented within each element, and I'm sure that the problems I faced (now I understand it a bit better) were the same as any screen scraper. There is nothing special or secret about my code and I'd happily share it but I think there are already a number of articles that describe the process perfectly well; go to Luc's home page for a good start, also JSOP.
Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman