Problem with downloading html from the web with httpwebrequest object
-
hi. im trying to download html text from 'amazon.com' using this method:
HttpWebRequest hRequest = (HttpWebRequest)WebRequest.Create(url); hRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, */*"; hRequest.ContentType = "application/x-www-form-urlencoded"; hRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; InfoPath.2)"; hRequest.Headers.Add("Accept-Encoding", "gzip, deflate"); hRequest.Headers.Add("UA-CPU", "x86"); hRequest.Method = "GET"; HttpWebResponse hResponse = (HttpWebResponse)hRequest.GetResponse(); StreamReader s =new StreamReader(hResponse.GetResponseStream(),Encoding.GetEncoding(hResponse.CharacterSet)); string page = s.ReadToEnd();
i know that amazon uses character set of "iso-8859-1", thats also returned by the httpwebresponse.characterset property. but for some reason when i examine the string it contains scrambled charecters, so when i want to search that text using all sort of string methods it dosent work. however if i use the "webclinet" object downloadstring method to retrieve the page it shows up fine, but it also takes him 30 sec to get the string! i dont know if its like that because of a heavy processing or something else, but its not flexible enough and dosent answer my needs. anyone have an idea why im getting an invalid string? -
hi. im trying to download html text from 'amazon.com' using this method:
HttpWebRequest hRequest = (HttpWebRequest)WebRequest.Create(url); hRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, */*"; hRequest.ContentType = "application/x-www-form-urlencoded"; hRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; InfoPath.2)"; hRequest.Headers.Add("Accept-Encoding", "gzip, deflate"); hRequest.Headers.Add("UA-CPU", "x86"); hRequest.Method = "GET"; HttpWebResponse hResponse = (HttpWebResponse)hRequest.GetResponse(); StreamReader s =new StreamReader(hResponse.GetResponseStream(),Encoding.GetEncoding(hResponse.CharacterSet)); string page = s.ReadToEnd();
i know that amazon uses character set of "iso-8859-1", thats also returned by the httpwebresponse.characterset property. but for some reason when i examine the string it contains scrambled charecters, so when i want to search that text using all sort of string methods it dosent work. however if i use the "webclinet" object downloadstring method to retrieve the page it shows up fine, but it also takes him 30 sec to get the string! i dont know if its like that because of a heavy processing or something else, but its not flexible enough and dosent answer my needs. anyone have an idea why im getting an invalid string? -
Did you try 'HtmlDecode' the read text?
xacc.ide - now with TabsToSpaces support
IronScheme - 1.0 beta 1 - out now!
((lambda (x) `((lambda (x) ,x) ',x)) '`((lambda (x) ,x) ',x))from my undersatanding the htmldecode method just replaces encoded characters such as "<" and so on to an html characters. thats not the issue in my case. but thanks anyway
-
from my undersatanding the htmldecode method just replaces encoded characters such as "<" and so on to an html characters. thats not the issue in my case. but thanks anyway
Probably figured this out already but: // read data via the response stream Stream resStream = response.GetResponseStream(); string tempString = null; int count = 0; do { count = resStream.Read(buf, 0, buf.Length); if (count != 0) { // translate from bytes to ASCII text tempString = Encoding.ASCII.GetString(buf, 0, count); // continue building the string sb.Append(tempString); } } while (count > 0); Cheers, RG