Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. crawling a website

crawling a website

Scheduled Pinned Locked Moved C#
questiondatabasecom
2 Posts 2 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • U Offline
    U Offline
    uglyeyes
    wrote on last edited by
    #1

    Hi, I need to crawl a website of below url format and store the content locally in a csv file now, I only need to crawl URL of below format. I should be able to do a loop with query string ranging from 0 to 1000000(or a max value present). http://www.mysite.com.au/products/products.asp?p=101 the only way i think will be to do a loop from 0 to 1000000 but again there are some id that doesnt exists and redirects to main page which i need to exclude from crawl list i.e it redirects to http://www.mysite.com.au/products page. below is the code so far i have coded. please assist me how can i achieve this.

    public static void CrawlSite()
    {
    Console.WriteLine("Beginning crawl.");
    CrawlPage("http://www.mysite.com.au/products/products.asp?p="); Console.WriteLine("Finished crawl.");

        }
    

    private static void CrawlPage(string url)
    {
    for(int i=0; i<=100000; i++)
    {
    Console.WriteLine("Crawling " + url+i.tostring());

    	HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    	request.UserAgent = "blah!";
    
    	WebResponse response = request.GetResponse();
    
    	Stream stream = response.GetResponseStream();
    
    	StreamReader reader = new StreamReader(stream);
    	string htmlText = reader.ReadToEnd();
    	
    	//logic to check if it sends to home page, if it doesnt dont write it to a file
    	using (StreamWriter sw = File.AppendText(@"c:\\logs\\data.txt"))
    	{
    	
    		sw.WriteLine(htmlText);
    		sw.Close();
    	}
    	
    	
    }
    

    }

    R 1 Reply Last reply
    0
    • U uglyeyes

      Hi, I need to crawl a website of below url format and store the content locally in a csv file now, I only need to crawl URL of below format. I should be able to do a loop with query string ranging from 0 to 1000000(or a max value present). http://www.mysite.com.au/products/products.asp?p=101 the only way i think will be to do a loop from 0 to 1000000 but again there are some id that doesnt exists and redirects to main page which i need to exclude from crawl list i.e it redirects to http://www.mysite.com.au/products page. below is the code so far i have coded. please assist me how can i achieve this.

      public static void CrawlSite()
      {
      Console.WriteLine("Beginning crawl.");
      CrawlPage("http://www.mysite.com.au/products/products.asp?p="); Console.WriteLine("Finished crawl.");

          }
      

      private static void CrawlPage(string url)
      {
      for(int i=0; i<=100000; i++)
      {
      Console.WriteLine("Crawling " + url+i.tostring());

      	HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
      	request.UserAgent = "blah!";
      
      	WebResponse response = request.GetResponse();
      
      	Stream stream = response.GetResponseStream();
      
      	StreamReader reader = new StreamReader(stream);
      	string htmlText = reader.ReadToEnd();
      	
      	//logic to check if it sends to home page, if it doesnt dont write it to a file
      	using (StreamWriter sw = File.AppendText(@"c:\\logs\\data.txt"))
      	{
      	
      		sw.WriteLine(htmlText);
      		sw.Close();
      	}
      	
      	
      }
      

      }

      R Offline
      R Offline
      RyanEK
      wrote on last edited by
      #2

      Look into Regex, that would be the way to go.

      1 Reply Last reply
      0
      Reply
      • Reply as topic
      Log in to reply
      • Oldest to Newest
      • Newest to Oldest
      • Most Votes


      • Login

      • Don't have an account? Register

      • Login or register to search.
      • First post
        Last post
      0
      • Categories
      • Recent
      • Tags
      • Popular
      • World
      • Users
      • Groups