crawling a website
-
Hi, I need to crawl a website of below url format and store the content locally in a csv file now, I only need to crawl URL of below format. I should be able to do a loop with query string ranging from 0 to 1000000(or a max value present). http://www.mysite.com.au/products/products.asp?p=101 the only way i think will be to do a loop from 0 to 1000000 but again there are some id that doesnt exists and redirects to main page which i need to exclude from crawl list i.e it redirects to http://www.mysite.com.au/products page. below is the code so far i have coded. please assist me how can i achieve this.
public static void CrawlSite()
{
Console.WriteLine("Beginning crawl.");
CrawlPage("http://www.mysite.com.au/products/products.asp?p="); Console.WriteLine("Finished crawl.");}
private static void CrawlPage(string url)
{
for(int i=0; i<=100000; i++)
{
Console.WriteLine("Crawling " + url+i.tostring());HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "blah!"; WebResponse response = request.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader reader = new StreamReader(stream); string htmlText = reader.ReadToEnd(); //logic to check if it sends to home page, if it doesnt dont write it to a file using (StreamWriter sw = File.AppendText(@"c:\\logs\\data.txt")) { sw.WriteLine(htmlText); sw.Close(); } }
}
-
Hi, I need to crawl a website of below url format and store the content locally in a csv file now, I only need to crawl URL of below format. I should be able to do a loop with query string ranging from 0 to 1000000(or a max value present). http://www.mysite.com.au/products/products.asp?p=101 the only way i think will be to do a loop from 0 to 1000000 but again there are some id that doesnt exists and redirects to main page which i need to exclude from crawl list i.e it redirects to http://www.mysite.com.au/products page. below is the code so far i have coded. please assist me how can i achieve this.
public static void CrawlSite()
{
Console.WriteLine("Beginning crawl.");
CrawlPage("http://www.mysite.com.au/products/products.asp?p="); Console.WriteLine("Finished crawl.");}
private static void CrawlPage(string url)
{
for(int i=0; i<=100000; i++)
{
Console.WriteLine("Crawling " + url+i.tostring());HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "blah!"; WebResponse response = request.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader reader = new StreamReader(stream); string htmlText = reader.ReadToEnd(); //logic to check if it sends to home page, if it doesnt dont write it to a file using (StreamWriter sw = File.AppendText(@"c:\\logs\\data.txt")) { sw.WriteLine(htmlText); sw.Close(); } }
}