HTML Page download in C#
-
Hi, all! Here is the problem: I'm writting some code that gets HTML page, and then grabs it's content. The content of the page is organized in multiple pages, and navigation between them is done by clicking on the page's number below the records (For example, there are 150 records, displayed in 10 pages * 15 records/page. Therefore, the web page contains 10 hyperlinks to other pages with records). Obviously, in order to get all the information needed, I need to loop through all the pages' links, download their HTML and afterwards parse the information. The problem is, that I can only download 2 pages from the list. For some unknown reason, my code freeses after it downloads 2 pages. The order of the pages does not matter, for example if I start from page #5, I can only get pages 5 and 6. According to common sense and VS debugger the problem lies in the method, that downloads HTML code
public delegate byte[] getHTTPdelegate(Uri address); // this is a delegate defined as a class member, used to perform async page download
public void downloadPage(string URL)
{
// creating new webclient
client = new WebClient();
//assigning download method to a delegate
getHTTPdelegate dl = client.DownloadData;// starting async download IAsyncResult ar = dl.BeginInvoke(new Uri(URL), null, null); while (!ar.IsCompleted) { Thread.Sleep(10); } // rawpage contains HTML in terms of byte\[\], the result of async download rawPage = dl.EndInvoke(ar); // this is also the line where exception occurs }
After downloading page #2, the application stops, and in a minute ar two throws an unhandled exception stating that operation has timed out. Please note, that the problem is not "why isn't it working?", but "why does it work only 2 times?", when it should be downloading all the pages. Any ideas will be highly appreciated.
-
Hi, all! Here is the problem: I'm writting some code that gets HTML page, and then grabs it's content. The content of the page is organized in multiple pages, and navigation between them is done by clicking on the page's number below the records (For example, there are 150 records, displayed in 10 pages * 15 records/page. Therefore, the web page contains 10 hyperlinks to other pages with records). Obviously, in order to get all the information needed, I need to loop through all the pages' links, download their HTML and afterwards parse the information. The problem is, that I can only download 2 pages from the list. For some unknown reason, my code freeses after it downloads 2 pages. The order of the pages does not matter, for example if I start from page #5, I can only get pages 5 and 6. According to common sense and VS debugger the problem lies in the method, that downloads HTML code
public delegate byte[] getHTTPdelegate(Uri address); // this is a delegate defined as a class member, used to perform async page download
public void downloadPage(string URL)
{
// creating new webclient
client = new WebClient();
//assigning download method to a delegate
getHTTPdelegate dl = client.DownloadData;// starting async download IAsyncResult ar = dl.BeginInvoke(new Uri(URL), null, null); while (!ar.IsCompleted) { Thread.Sleep(10); } // rawpage contains HTML in terms of byte\[\], the result of async download rawPage = dl.EndInvoke(ar); // this is also the line where exception occurs }
After downloading page #2, the application stops, and in a minute ar two throws an unhandled exception stating that operation has timed out. Please note, that the problem is not "why isn't it working?", but "why does it work only 2 times?", when it should be downloading all the pages. Any ideas will be highly appreciated.
-
The sleep loop is a bad idea and the downloading is not really asynchronous (because you're just waiting for it) Did you know WebClient has a method called DownloadDataAsync? I don't know why it's working twice.
Thanks for a reply, Harlod. I know about DownloadDataAsync, but I didn't try that. Now, I will try that, and post the result. UPDATE: I've implemented downloading via DownloadDataAsync, but still the problem of 2 pages remained :-( Also, the same exception was thrown.
modified on Saturday, June 26, 2010 10:11 AM