Parsing HTML
-
OK, I'd be first to post a link to Parsing Html The Cthulhu Way [^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading. The page is a list of files to download -- I need to extract the
href
s from thea
s, obviously I'd prefer to use XPath to do that. 0) The file doesn't contain an opening<HTML>
tag (it does have a closing</HTML>
tag :doh: ) -- I can tack one on, that's not a big deal. 1) It contains at least one entity (and possibly other entities) and the XmlDocument doesn't like that. :mad: So I need options, people! I can summon Cthulhu. X| I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument. What other options might there be? -
OK, I'd be first to post a link to Parsing Html The Cthulhu Way [^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading. The page is a list of files to download -- I need to extract the
href
s from thea
s, obviously I'd prefer to use XPath to do that. 0) The file doesn't contain an opening<HTML>
tag (it does have a closing</HTML>
tag :doh: ) -- I can tack one on, that's not a big deal. 1) It contains at least one entity (and possibly other entities) and the XmlDocument doesn't like that. :mad: So I need options, people! I can summon Cthulhu. X| I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument. What other options might there be?HTML != XML Use the HTML Agility Pack[^] instead.
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
HTML != XML Use the HTML Agility Pack[^] instead.
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
Ah, sooooo... let the summoning begin! Oh, mighty Cthulhu! Wise and terrible! I ask your assistance as my days have been blighted with some gnarly HTML! Please, oh lord, come smite the bare buttocks of the wretch who hast wrought this travesty. I will repay you with a pint of bitter. Not a measly USian pint mind you, but a proper Britsh pint.
-
Ah, sooooo... let the summoning begin! Oh, mighty Cthulhu! Wise and terrible! I ask your assistance as my days have been blighted with some gnarly HTML! Please, oh lord, come smite the bare buttocks of the wretch who hast wrought this travesty. I will repay you with a pint of bitter. Not a measly USian pint mind you, but a proper Britsh pint.
No need to make that call to R'lyeh yet; the HAP makes parsing an HTML document simple:
HtmlDocument doc = new HtmlDocument();
doc.Load(@"path\to\your\file.htm");foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string url = link["href"].Value;
Fhtagn(url);
}
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer