crawling a website

uglyeyes

Hi, I need to crawl a website of below url format and store the content locally in a csv file now, I only need to crawl URL of below format. I should be able to do a loop with query string ranging from 0 to 1000000(or a max value present). http://www.mysite.com.au/products/products.asp?p=101 the only way i think will be to do a loop from 0 to 1000000 but again there are some id that doesnt exists and redirects to main page which i need to exclude from crawl list i.e it redirects to http://www.mysite.com.au/products page. below is the code so far i have coded. please assist me how can i achieve this.

public static void CrawlSite()
{
Console.WriteLine("Beginning crawl.");
CrawlPage("http://www.mysite.com.au/products/products.asp?p="); Console.WriteLine("Finished crawl.");

private static void CrawlPage(string url)
{
for(int i=0; i<=100000; i++)
{
Console.WriteLine("Crawling " + url+i.tostring());

	HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
	request.UserAgent = "blah!";

	WebResponse response = request.GetResponse();

	Stream stream = response.GetResponseStream();

	StreamReader reader = new StreamReader(stream);
	string htmlText = reader.ReadToEnd();
	
	//logic to check if it sends to home page, if it doesnt dont write it to a file
	using (StreamWriter sw = File.AppendText(@"c:\\logs\\data.txt"))
	{
	
		sw.WriteLine(htmlText);
		sw.Close();
	}
	
	
}

}

RyanEK

Look into Regex, that would be the way to go.