Parsing HTML

PIEBALDconsult

OK, I'd be first to post a link to Parsing Html The Cthulhu Way [^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading. The page is a list of files to download -- I need to extract the hrefs from the as, obviously I'd prefer to use XPath to do that. 0) The file doesn't contain an opening <HTML> tag (it does have a closing </HTML> tag :doh: ) -- I can tack one on, that's not a big deal. 1) It contains at least one entity (and possibly other entities) and the XmlDocument doesn't like that. :mad: So I need options, people! I can summon Cthulhu. X| I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument. What other options might there be?

Richard Deeming

HTML != XML Use the HTML Agility Pack[^] instead.

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

PIEBALDconsult

Ah, sooooo... let the summoning begin! Oh, mighty Cthulhu! Wise and terrible! I ask your assistance as my days have been blighted with some gnarly HTML! Please, oh lord, come smite the bare buttocks of the wretch who hast wrought this travesty. I will repay you with a pint of bitter. Not a measly USian pint mind you, but a proper Britsh pint.

Richard Deeming

No need to make that call to R'lyeh yet; the HAP makes parsing an HTML document simple:

HtmlDocument doc = new HtmlDocument();
doc.Load(@"path\to\your\file.htm");

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string url = link["href"].Value;
Fhtagn(url);
}

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer