Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Parsing HTML of some website using C Sharp.

Parsing HTML of some website using C Sharp.

Scheduled Pinned Locked Moved C#
csharphtmlregexjsonhelp
4 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S Offline
    S Offline
    shivamkalra
    wrote on last edited by
    #1

    Hello everyone, I'm working on a small hobby project. I'm making a web crawler for a particular website to extract some useful information from it. I've written information extraction algorithms but I'm completely new to HTTP response stuff. I'm using a HttpWebResponse class of .Net to get the source code of a webpage in form of StreamReader. Now, I'm wondering if I should process each line of stream or I should convert whole streamReader to a string and then process that. Let say, I'm looking for string that should be able to extract .mp3 links on webpage using regex then is it possible that a single link is on two differnt lines of StreamReader, look at the code below..

            string url = "http://songs.pk";
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    
            StreamReader sr = new StreamReader(response.GetResponseStream());
    
            string line;
            while ((line = sr.ReadLine()) != null)
            {
    

    // Add some processing code here
    Console.Read();
    Console.WriteLine(line);
    }

    Now can I add some regex matching code here..will this code make sure that it will be able to extract all the .mp3 links from this website? Or should I convert the StreamReader to string and then use regex matching on that string? I'm sorry, if I'm misunderstanding about StreamReader here, but I need some suggestions to parse the source code of a website. I've searched articles and Google but I'm unable to find something that could help me. Any articles, links or suggestions would be appreciated. Thanks Shivam Kalra

    J B G 3 Replies Last reply
    0
    • S shivamkalra

      Hello everyone, I'm working on a small hobby project. I'm making a web crawler for a particular website to extract some useful information from it. I've written information extraction algorithms but I'm completely new to HTTP response stuff. I'm using a HttpWebResponse class of .Net to get the source code of a webpage in form of StreamReader. Now, I'm wondering if I should process each line of stream or I should convert whole streamReader to a string and then process that. Let say, I'm looking for string that should be able to extract .mp3 links on webpage using regex then is it possible that a single link is on two differnt lines of StreamReader, look at the code below..

              string url = "http://songs.pk";
              HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
      
              HttpWebResponse response = (HttpWebResponse)request.GetResponse();
      
              StreamReader sr = new StreamReader(response.GetResponseStream());
      
              string line;
              while ((line = sr.ReadLine()) != null)
              {
      

      // Add some processing code here
      Console.Read();
      Console.WriteLine(line);
      }

      Now can I add some regex matching code here..will this code make sure that it will be able to extract all the .mp3 links from this website? Or should I convert the StreamReader to string and then use regex matching on that string? I'm sorry, if I'm misunderstanding about StreamReader here, but I need some suggestions to parse the source code of a website. I've searched articles and Google but I'm unable to find something that could help me. Any articles, links or suggestions would be appreciated. Thanks Shivam Kalra

      J Offline
      J Offline
      JV9999
      wrote on last edited by
      #2

      Your code is fine. You will read the HTML code line-by-line from the page and you can process the whole html page line by line. This is a pretty common scenario. You can use a regex to determine if the value from the line-field contains what you are searching for :).

      1 Reply Last reply
      0
      • S shivamkalra

        Hello everyone, I'm working on a small hobby project. I'm making a web crawler for a particular website to extract some useful information from it. I've written information extraction algorithms but I'm completely new to HTTP response stuff. I'm using a HttpWebResponse class of .Net to get the source code of a webpage in form of StreamReader. Now, I'm wondering if I should process each line of stream or I should convert whole streamReader to a string and then process that. Let say, I'm looking for string that should be able to extract .mp3 links on webpage using regex then is it possible that a single link is on two differnt lines of StreamReader, look at the code below..

                string url = "http://songs.pk";
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
        
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        
                StreamReader sr = new StreamReader(response.GetResponseStream());
        
                string line;
                while ((line = sr.ReadLine()) != null)
                {
        

        // Add some processing code here
        Console.Read();
        Console.WriteLine(line);
        }

        Now can I add some regex matching code here..will this code make sure that it will be able to extract all the .mp3 links from this website? Or should I convert the StreamReader to string and then use regex matching on that string? I'm sorry, if I'm misunderstanding about StreamReader here, but I need some suggestions to parse the source code of a website. I've searched articles and Google but I'm unable to find something that could help me. Any articles, links or suggestions would be appreciated. Thanks Shivam Kalra

        B Offline
        B Offline
        BobJanova
        wrote on last edited by
        #3

        I would read the whole stream (to get the full page content) and use your regexes on that. The streaming is very useful for large downloads, but a web page is easily small enough to work on in memory. StreamReader.ReadToEnd is the easiest way to do that.

        1 Reply Last reply
        0
        • S shivamkalra

          Hello everyone, I'm working on a small hobby project. I'm making a web crawler for a particular website to extract some useful information from it. I've written information extraction algorithms but I'm completely new to HTTP response stuff. I'm using a HttpWebResponse class of .Net to get the source code of a webpage in form of StreamReader. Now, I'm wondering if I should process each line of stream or I should convert whole streamReader to a string and then process that. Let say, I'm looking for string that should be able to extract .mp3 links on webpage using regex then is it possible that a single link is on two differnt lines of StreamReader, look at the code below..

                  string url = "http://songs.pk";
                  HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
          
                  HttpWebResponse response = (HttpWebResponse)request.GetResponse();
          
                  StreamReader sr = new StreamReader(response.GetResponseStream());
          
                  string line;
                  while ((line = sr.ReadLine()) != null)
                  {
          

          // Add some processing code here
          Console.Read();
          Console.WriteLine(line);
          }

          Now can I add some regex matching code here..will this code make sure that it will be able to extract all the .mp3 links from this website? Or should I convert the StreamReader to string and then use regex matching on that string? I'm sorry, if I'm misunderstanding about StreamReader here, but I need some suggestions to parse the source code of a website. I've searched articles and Google but I'm unable to find something that could help me. Any articles, links or suggestions would be appreciated. Thanks Shivam Kalra

          G Offline
          G Offline
          GenJerDan
          wrote on last edited by
          #4

          You could also use the HTMLDocument, IHTMLElementCollection, and IHTMLElement to read each tag found in the response and grab the innerHTML if it's a tag you're interested in.

          Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life. My Mu[sic] My Films My Windows Programs, etc.

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups