Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Web Scraping XML file

Web Scraping XML file

Scheduled Pinned Locked Moved C#
csharpxmlquestion
8 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    mjackson11
    wrote on last edited by
    #1

    I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

    P L K B M 5 Replies Last reply
    0
    • M mjackson11

      I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

      P Offline
      P Offline
      PIEBALDconsult
      wrote on last edited by
      #2

      Dunno. Maybe try the WebBrowser control and check its document property.

      D 1 Reply Last reply
      0
      • M mjackson11

        I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

        L Offline
        L Offline
        Lost User
        wrote on last edited by
        #3

        Quote:

        Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page?

        No, neither of these will perform any output rendering. The XML file you get is exactly what they are sending to your browser. Your browser is then performing parsing of the xml xsl (XML Style Sheet), which uses javascript to generate the pretty table... As the previous poster mentioned, you could try using the Web Browser control to grab the page and render it, and see if you can get the rendered source from either WebBrowser.DocumentText or WebBrowser.DocumentStream.

        “I have no special talents. I am only passionately curious.” - Albert Einstein

        1 Reply Last reply
        0
        • P PIEBALDconsult

          Dunno. Maybe try the WebBrowser control and check its document property.

          D Offline
          D Offline
          Dave Kreskowiak
          wrote on last edited by
          #4

          Some idiot voted you a 1 for a perfectly reasonable answer. Countered.

          A guide to posting questions on CodeProject[^]
          Dave Kreskowiak

          P 1 Reply Last reply
          0
          • M mjackson11

            I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

            K Offline
            K Offline
            Keith Barrow
            wrote on last edited by
            #5

            It isn't an html page (if that's what you mean by a web page), it's XML. The XML contains a link to a stylesheet:

            <?xml-stylesheet type="text/xsl" href="http://reports.ieso.ca/docrefs/stylesheet/GenOutputCapability\_HTML\_t1-1.xsl" ?>

            This is used by your browser to transform the XML into HTML, which it displays (assuming your browser supports this). If you are using the WebBrowser (as Piebald Consultant suggests) control, you might need to transform yourself. This article has a basic outline of how: http://ivanov.wordpress.com/2006/11/17/xml-to-html/[^] Obviously, you'll need to get the transform file first, which means parsing the XML to get its location and downloading it after the XML has been received.

            Sort of a cross between Lawrence of Arabia and Dilbert.[^]
            -Or-
            A Dead ringer for Kate Winslett[^]

            1 Reply Last reply
            0
            • D Dave Kreskowiak

              Some idiot voted you a 1 for a perfectly reasonable answer. Countered.

              A guide to posting questions on CodeProject[^]
              Dave Kreskowiak

              P Offline
              P Offline
              PIEBALDconsult
              wrote on last edited by
              #6

              Oh, thanks. :thumbsup:

              1 Reply Last reply
              0
              • M mjackson11

                I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

                B Offline
                B Offline
                BobJanova
                wrote on last edited by
                #7

                The calculation of the totals is done by the XSD, along with the layout. You have two options: parse the XML and do the data manipulation that you need yourself (all the primary data is in the XML, i.e. all the information that is displayed can be generated from it), or use a XSD-capable library to turn the XML into HTML and then parse information out of that. I'd go the first way: read that XML into a DataTable or a List<Generator> (some parsing code will probably be needed though if you set up your objects correctly you should be able to Linq-to-XML-load it), and then do grouping, totalling etc as you require.

                1 Reply Last reply
                0
                • M mjackson11

                  I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson

                  M Offline
                  M Offline
                  mjackson11
                  wrote on last edited by
                  #8

                  I took the easy route which was to create a small forms application that used a WebBrowser object to render the page, then output the text stream from the WebBrowser to an email message. Not elegant but it works. Thank you for all the replies. Mark Jackson

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Don't have an account? Register

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups