Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. Oh, that ol' Cthulhu sure is sneaky...

Oh, that ol' Cthulhu sure is sneaky...

Scheduled Pinned Locked Moved The Weird and The Wonderful
helpcsharphtmlwinformscom
11 Posts 7 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P PIEBALDconsult

    But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

    <th style="width: 5%"><!-- rule --></td>

    :omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

    B Offline
    B Offline
    Brisingr Aerowing
    wrote on last edited by
    #2

    My favorite is AngleSharp[^]

    What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

    C 1 Reply Last reply
    0
    • B Brisingr Aerowing

      My favorite is AngleSharp[^]

      What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

      C Offline
      C Offline
      Chris Maunder
      wrote on last edited by
      #3

      We're moving off the AgilityPack onto AngleSharp.

      cheers Chris Maunder

      B P 2 Replies Last reply
      0
      • P PIEBALDconsult

        But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

        <th style="width: 5%"><!-- rule --></td>

        :omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

        C Offline
        C Offline
        Chris Maunder
        wrote on last edited by
        #4

        One of my earliest gigs was writing and XML, and then HTML, parser. I learned why browsers treat HTML so differently, but never learned why browser writers were so pig-headed in their insistence on sticking to clearly ludicrous decisions when ambiguity in the "spec" surfaced. As it did often back then. So everytime I see a HTML parser I give a solemn nod to the author. And then wish them the speediest exit possible from that gig.

        cheers Chris Maunder

        1 Reply Last reply
        0
        • P PIEBALDconsult

          But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

          <th style="width: 5%"><!-- rule --></td>

          :omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

          V Offline
          V Offline
          V 0
          wrote on last edited by
          #5

          Somehow, I immediately thought of this when I saw the title of your post. Enjoy[^] :)

          V.

          (MQOTD rules and previous solutions)

          M 1 Reply Last reply
          0
          • P PIEBALDconsult

            But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

            <th style="width: 5%"><!-- rule --></td>

            :omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

            OriginalGriffO Offline
            OriginalGriffO Offline
            OriginalGriff
            wrote on last edited by
            #6

            PIEBALDconsult wrote:

            the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun

            You're a cruel, cruel man. I like it. :thumbsup:

            Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...

            "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
            "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

            1 Reply Last reply
            0
            • C Chris Maunder

              We're moving off the AgilityPack onto AngleSharp.

              cheers Chris Maunder

              B Offline
              B Offline
              Brisingr Aerowing
              wrote on last edited by
              #7

              AngleSharp is easily one of the best parsers out there. And it seems Firefox doesn't think parsers is a word and wants it to be passer or parers.

              What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

              1 Reply Last reply
              0
              • P PIEBALDconsult

                But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

                <th style="width: 5%"><!-- rule --></td>

                :omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

                D Offline
                D Offline
                Denis A Stoyanov
                wrote on last edited by
                #8

                So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun.

                Then a few days later after he is broken just send him this piece of art.

                1 Reply Last reply
                0
                • V V 0

                  Somehow, I immediately thought of this when I saw the title of your post. Enjoy[^] :)

                  V.

                  (MQOTD rules and previous solutions)

                  M Offline
                  M Offline
                  Middle Manager
                  wrote on last edited by
                  #9

                  Thanks for the listen man! :thumbsup: I now want to kick ass on this morning. :-D

                  1 Reply Last reply
                  0
                  • C Chris Maunder

                    We're moving off the AgilityPack onto AngleSharp.

                    cheers Chris Maunder

                    P Offline
                    P Offline
                    PIEBALDconsult
                    wrote on last edited by
                    #10

                    I'm beginning to think that the HtmlAgilityPack uses RegularExpressions. :sigh: I'll have to try AngleSharp. Oh, look, an article... :-D

                    B 1 Reply Last reply
                    0
                    • P PIEBALDconsult

                      I'm beginning to think that the HtmlAgilityPack uses RegularExpressions. :sigh: I'll have to try AngleSharp. Oh, look, an article... :-D

                      B Offline
                      B Offline
                      Brisingr Aerowing
                      wrote on last edited by
                      #11

                      A quick look at the HAP source code and it seems they parse it character by character. I guess that's why it was so slow (it spent over three minutes 'parsing') when I tested it on a 1298 line HTML file (I can't remember where I found that file). AngleSharp parsed the same file much faster (in a few seconds).

                      What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

                      1 Reply Last reply
                      0
                      Reply
                      • Reply as topic
                      Log in to reply
                      • Oldest to Newest
                      • Newest to Oldest
                      • Most Votes


                      • Login

                      • Don't have an account? Register

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • World
                      • Users
                      • Groups