Parsing HTML of some website using C Sharp.

shivamkalra

Hello everyone, I'm working on a small hobby project. I'm making a web crawler for a particular website to extract some useful information from it. I've written information extraction algorithms but I'm completely new to HTTP response stuff. I'm using a HttpWebResponse class of .Net to get the source code of a webpage in form of StreamReader. Now, I'm wondering if I should process each line of stream or I should convert whole streamReader to a string and then process that. Let say, I'm looking for string that should be able to extract .mp3 links on webpage using regex then is it possible that a single link is on two differnt lines of StreamReader, look at the code below..

        string url = "http://songs.pk";
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);

        HttpWebResponse response = (HttpWebResponse)request.GetResponse();

        StreamReader sr = new StreamReader(response.GetResponseStream());

        string line;
        while ((line = sr.ReadLine()) != null)
        {

// Add some processing code here
Console.Read();
Console.WriteLine(line);
}

Now can I add some regex matching code here..will this code make sure that it will be able to extract all the .mp3 links from this website? Or should I convert the StreamReader to string and then use regex matching on that string? I'm sorry, if I'm misunderstanding about StreamReader here, but I need some suggestions to parse the source code of a website. I've searched articles and Google but I'm unable to find something that could help me. Any articles, links or suggestions would be appreciated. Thanks Shivam Kalra

JV9999

Your code is fine. You will read the HTML code line-by-line from the page and you can process the whole html page line by line. This is a pretty common scenario. You can use a regex to determine if the value from the line-field contains what you are searching for :).

BobJanova

I would read the whole stream (to get the full page content) and use your regexes on that. The streaming is very useful for large downloads, but a web page is easily small enough to work on in memory. StreamReader.ReadToEnd is the easiest way to do that.

GenJerDan

You could also use the HTMLDocument, IHTMLElementCollection, and IHTMLElement to read each tag found in the response and grab the innerHTML if it's a tag you're interested in.

Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life. My Mu[sic] My Films My Windows Programs, etc.