Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. XML / XSL
  4. Parsing HTML

Parsing HTML

Scheduled Pinned Locked Moved XML / XSL
htmlcomxmljsonhelp
4 Posts 2 Posters 18 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P Online
    P Online
    PIEBALDconsult
    wrote on last edited by
    #1

    OK, I'd be first to post a link to Parsing Html The Cthulhu Way [^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading. The page is a list of files to download -- I need to extract the hrefs from the as, obviously I'd prefer to use XPath to do that. 0) The file doesn't contain an opening <HTML> tag (it does have a closing </HTML> tag :doh: ) -- I can tack one on, that's not a big deal. 1) It contains at least one   entity (and possibly other entities) and the XmlDocument doesn't like that. :mad: So I need options, people! I can summon Cthulhu. X| I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument. What other options might there be?

    Richard DeemingR 1 Reply Last reply
    0
    • P PIEBALDconsult

      OK, I'd be first to post a link to Parsing Html The Cthulhu Way [^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading. The page is a list of files to download -- I need to extract the hrefs from the as, obviously I'd prefer to use XPath to do that. 0) The file doesn't contain an opening <HTML> tag (it does have a closing </HTML> tag :doh: ) -- I can tack one on, that's not a big deal. 1) It contains at least one   entity (and possibly other entities) and the XmlDocument doesn't like that. :mad: So I need options, people! I can summon Cthulhu. X| I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument. What other options might there be?

      Richard DeemingR Offline
      Richard DeemingR Offline
      Richard Deeming
      wrote on last edited by
      #2

      HTML != XML Use the HTML Agility Pack[^] instead.


      "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

      "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

      P 1 Reply Last reply
      0
      • Richard DeemingR Richard Deeming

        HTML != XML Use the HTML Agility Pack[^] instead.


        "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

        P Online
        P Online
        PIEBALDconsult
        wrote on last edited by
        #3

        Ah, sooooo... let the summoning begin! Oh, mighty Cthulhu! Wise and terrible! I ask your assistance as my days have been blighted with some gnarly HTML! Please, oh lord, come smite the bare buttocks of the wretch who hast wrought this travesty. I will repay you with a pint of bitter. Not a measly USian pint mind you, but a proper Britsh pint.

        Richard DeemingR 1 Reply Last reply
        0
        • P PIEBALDconsult

          Ah, sooooo... let the summoning begin! Oh, mighty Cthulhu! Wise and terrible! I ask your assistance as my days have been blighted with some gnarly HTML! Please, oh lord, come smite the bare buttocks of the wretch who hast wrought this travesty. I will repay you with a pint of bitter. Not a measly USian pint mind you, but a proper Britsh pint.

          Richard DeemingR Offline
          Richard DeemingR Offline
          Richard Deeming
          wrote on last edited by
          #4

          No need to make that call to R'lyeh yet; the HAP makes parsing an HTML document simple:

          HtmlDocument doc = new HtmlDocument();
          doc.Load(@"path\to\your\file.htm");

          foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
          {
          string url = link["href"].Value;
          Fhtagn(url);
          }


          "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

          "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups