Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Windows Forms
  4. Screen scraping C#

Screen scraping C#

Scheduled Pinned Locked Moved Windows Forms
helpcsharpcomtoolsquestion
11 Posts 4 Posters 3 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    Lost User
    wrote on last edited by
    #1

    I've been working on a screen scraper application using the WebBrowser class, for an organisation I am involved with. The basic problem is that I get a screen of data that's presented in table form and use the HTMLElement Innertext item to get the raw data, which I then need to parse to extract the bits I want (actual details are not important). However, I have found that the same screenful of information is passed to me in a slightly different format depending on whether my client PC is running XP, Vista or Windows 7. The content is exactly the same but fields are separated by spaces, \r\n or even \r\n\r\n sequences. Has anyone else come across a similar issue, and if so how did you resolve it?

    Unrequited desire is character building. OriginalGriff

    L 1 Reply Last reply
    0
    • L Lost User

      I've been working on a screen scraper application using the WebBrowser class, for an organisation I am involved with. The basic problem is that I get a screen of data that's presented in table form and use the HTMLElement Innertext item to get the raw data, which I then need to parse to extract the bits I want (actual details are not important). However, I have found that the same screenful of information is passed to me in a slightly different format depending on whether my client PC is running XP, Vista or Windows 7. The content is exactly the same but fields are separated by spaces, \r\n or even \r\n\r\n sequences. Has anyone else come across a similar issue, and if so how did you resolve it?

      Unrequited desire is character building. OriginalGriff

      L Offline
      L Offline
      Luc Pattyn
      wrote on last edited by
      #2

      My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces,   etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).

      Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

      L 3 Replies Last reply
      0
      • L Luc Pattyn

        My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces,   etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).

        Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #3

        Thanks for the suggestions Luc. 1. I'll take a look at these options, I did not look too closely at existing samples before starting my project. 2. The problem with the main screen is that it is a large table and it looks considerably easier to parse as text rather than going through all the HTML table items. PS: Of course, that is the obvious answer, thanks again.

        Unrequited desire is character building. OriginalGriff

        R 1 Reply Last reply
        0
        • L Lost User

          Thanks for the suggestions Luc. 1. I'll take a look at these options, I did not look too closely at existing samples before starting my project. 2. The problem with the main screen is that it is a large table and it looks considerably easier to parse as text rather than going through all the HTML table items. PS: Of course, that is the obvious answer, thanks again.

          Unrequited desire is character building. OriginalGriff

          R Offline
          R Offline
          Ravi Bhavnani
          wrote on last edited by
          #4

          This[^] article may come in handy. /ravi

          My new year resolution: 2048 x 1536 Home | Articles | My .NET bits | Freeware ravib(at)ravib(dot)com

          L 1 Reply Last reply
          0
          • R Ravi Bhavnani

            This[^] article may come in handy. /ravi

            My new year resolution: 2048 x 1536 Home | Articles | My .NET bits | Freeware ravib(at)ravib(dot)com

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            Thanks Ravi, I'll certainly have a look at it.

            Unrequited desire is character building. OriginalGriff

            1 Reply Last reply
            0
            • L Luc Pattyn

              My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces,   etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).

              Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

              L Offline
              L Offline
              Lost User
              wrote on last edited by
              #6

              I have been looking into your suggestions at point 1, but I have a suspicion it will not fit for what I'm doing (my knowledge of Web apps is very weak). The website is a non-public site so my application has to login to get the information I need. As far as I'm aware, session data such as authentication information is held in the browser, so the Request/Response model will not work for me. Or do you perhaps know an answer to this?

              Unrequited desire is character building. OriginalGriff

              L 1 Reply Last reply
              0
              • L Lost User

                I have been looking into your suggestions at point 1, but I have a suspicion it will not fit for what I'm doing (my knowledge of Web apps is very weak). The website is a non-public site so my application has to login to get the information I need. As far as I'm aware, session data such as authentication information is held in the browser, so the Request/Response model will not work for me. Or do you perhaps know an answer to this?

                Unrequited desire is character building. OriginalGriff

                L Offline
                L Offline
                Luc Pattyn
                wrote on last edited by
                #7

                Richard MacCutchan wrote:

                Or do you perhaps know an answer to this?

                I always have some kind of an answer. :-D Here I'd say WebBrowser is a high-level Control you can mimick as much as you want based on the lower-level classes I mentioned. I haven't been dealing with session data yet, however I expect it is quite doable. But then it probably doesn't make much sense if WebBrowser offers it for free and you don't have good reasons not to use it. Maybe Test Http endpoints with WebDev.WebServer, NUnit and Salient.Web.HttpLib[^] coukd help you a bit, mind you it is a search result, I didn't read it. :)

                Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

                L 1 Reply Last reply
                0
                • L Luc Pattyn

                  Richard MacCutchan wrote:

                  Or do you perhaps know an answer to this?

                  I always have some kind of an answer. :-D Here I'd say WebBrowser is a high-level Control you can mimick as much as you want based on the lower-level classes I mentioned. I haven't been dealing with session data yet, however I expect it is quite doable. But then it probably doesn't make much sense if WebBrowser offers it for free and you don't have good reasons not to use it. Maybe Test Http endpoints with WebDev.WebServer, NUnit and Salient.Web.HttpLib[^] coukd help you a bit, mind you it is a search result, I didn't read it. :)

                  Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

                  L Offline
                  L Offline
                  Lost User
                  wrote on last edited by
                  #8

                  Luc Pattyn wrote:

                  I always have some kind of an answer.

                  Exactly why I addressed my question to you. :thumbsup:

                  Unrequited desire is character building. OriginalGriff

                  1 Reply Last reply
                  0
                  • L Luc Pattyn

                    My two cents on web scraping: 1. I'd rather use WebClient or HttpWebRequest/HttpWebResponse, unless there is some JavaScript/CSS going on that modified the incoming page before it gets displayed. 2. as web pages tend to be redesigned all the time, your parser needs to be as tolerant as it possibly can. Therefore I typically try to locate the area of interest based on HTML tags, substring that part, then remove all irrelevant stuff, such as more HTML tags (they suddenly wanted the names in bold, the numbers in a larger font, etc), and reduce whitespace to its essential level (getting rid of \r, \n, \t, multiple spaces,   etc). You simply can't just rely on the exact page content, it will break in a matter of days or weeks, and you will be blamed for your app no longer working properly. :) PS: when your OS'es vary, so will your IE versions. And they may behave differently from one version to the next (that is the whole idea of having a new version apparently).

                    Luc Pattyn [My Articles] Nil Volentibus Arduum iSad

                    L Offline
                    L Offline
                    Lost User
                    wrote on last edited by
                    #9

                    I took your advice and went for the HTML, and it is considerably easier to parse than the raw text. The data I am using changed again recently and the changes I needed to make to my code were much simpler than if I had stuck with text. Thanks for the tips, I now have a much simpler program to maintain and modify.

                    Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman

                    B 1 Reply Last reply
                    0
                    • L Lost User

                      I took your advice and went for the HTML, and it is considerably easier to parse than the raw text. The data I am using changed again recently and the changes I needed to make to my code were much simpler than if I had stuck with text. Thanks for the tips, I now have a much simpler program to maintain and modify.

                      Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman

                      B Offline
                      B Offline
                      BillWoodruff
                      wrote on last edited by
                      #10

                      Look forward to a Tip/Trick on how you parsed the HTML ... with the understanding that the format of your non-public html source may be so unique that what you had to do to parse it just doesn't generalize out to a wider range of scraping/parsing scenarios :) best, Bill

                      "For no man lives in the external truth among salts and acids, but in the warm, phantasmagoric chamber of his brain, with the painted windows and the storied wall." Robert Louis Stevenson

                      L 1 Reply Last reply
                      0
                      • B BillWoodruff

                        Look forward to a Tip/Trick on how you parsed the HTML ... with the understanding that the format of your non-public html source may be so unique that what you had to do to parse it just doesn't generalize out to a wider range of scraping/parsing scenarios :) best, Bill

                        "For no man lives in the external truth among salts and acids, but in the warm, phantasmagoric chamber of his brain, with the painted windows and the storied wall." Robert Louis Stevenson

                        L Offline
                        L Offline
                        Lost User
                        wrote on last edited by
                        #11

                        Sorry, but there was nothing special about what I did, just used the DOM tree to get to the elements I needed and pulled the information from it. It's not the HTML that is the issue but how the content is presented within each element, and I'm sure that the problems I faced (now I understand it a bit better) were the same as any screen scraper. There is nothing special or secret about my code and I'd happily share it but I think there are already a number of articles that describe the process perfectly well; go to Luc's home page for a good start, also JSOP.

                        Unrequited desire is character building. OriginalGriff I'm sitting here giving you a standing ovation - Len Goodman

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups