Web Scraping XML file
-
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
-
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
Dunno. Maybe try the WebBrowser control and check its document property.
-
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
Quote:
Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page?
No, neither of these will perform any output rendering. The XML file you get is exactly what they are sending to your browser. Your browser is then performing parsing of the xml xsl (XML Style Sheet), which uses javascript to generate the pretty table... As the previous poster mentioned, you could try using the Web Browser control to grab the page and render it, and see if you can get the rendered source from either WebBrowser.DocumentText or WebBrowser.DocumentStream.
“I have no special talents. I am only passionately curious.” - Albert Einstein
-
Dunno. Maybe try the WebBrowser control and check its document property.
Some idiot voted you a 1 for a perfectly reasonable answer. Countered.
A guide to posting questions on CodeProject[^]
Dave Kreskowiak -
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
It isn't an html page (if that's what you mean by a web page), it's XML. The XML contains a link to a stylesheet:
<?xml-stylesheet type="text/xsl" href="http://reports.ieso.ca/docrefs/stylesheet/GenOutputCapability\_HTML\_t1-1.xsl" ?>
This is used by your browser to transform the XML into HTML, which it displays (assuming your browser supports this). If you are using the WebBrowser (as Piebald Consultant suggests) control, you might need to transform yourself. This article has a basic outline of how: http://ivanov.wordpress.com/2006/11/17/xml-to-html/[^] Obviously, you'll need to get the transform file first, which means parsing the XML to get its location and downloading it after the XML has been received.
Sort of a cross between Lawrence of Arabia and Dilbert.[^]
-Or-
A Dead ringer for Kate Winslett[^] -
Some idiot voted you a 1 for a perfectly reasonable answer. Countered.
A guide to posting questions on CodeProject[^]
Dave KreskowiakOh, thanks. :thumbsup:
-
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
The calculation of the totals is done by the XSD, along with the layout. You have two options: parse the XML and do the data manipulation that you need yourself (all the primary data is in the XML, i.e. all the information that is displayed can be generated from it), or use a XSD-capable library to turn the XML into HTML and then parse information out of that. I'd go the first way: read that XML into a DataTable or a List<Generator> (some parsing code will probably be needed though if you set up your objects correctly you should be able to Linq-to-XML-load it), and then do grouping, totalling etc as you require.
-
I am trying to scrape a web page. The URL is http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml If I open the link in a browser window there is information about when the report was created, totals, etc. If I scrape the URL from C# using a WebClient or WebRequest, only the XML file is returned. It appears that much of the information on the page is stored in a dymanically created style sheet on their web site. Is there a way to use a WebClient or WebRequest to get exactly what I see on the web page? Thanks, Mark Jackson
I took the easy route which was to create a small forms application that used a WebBrowser object to render the page, then output the text stream from the WebBrowser to an email message. Not elegant but it works. Thank you for all the replies. Mark Jackson