Oh, that ol' Cthulhu sure is sneaky...

PIEBALDconsult

But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about “Start tag <td> was not found” -- which was surprising. The problem? Several elements like this:

:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:

Brisingr Aerowing

My favorite is AngleSharp[^]

What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

Chris Maunder

We're moving off the AgilityPack onto AngleSharp.

cheers Chris Maunder

Chris Maunder

One of my earliest gigs was writing and XML, and then HTML, parser. I learned why browsers treat HTML so differently, but never learned why browser writers were so pig-headed in their insistence on sticking to clearly ludicrous decisions when ambiguity in the "spec" surfaced. As it did often back then. So everytime I see a HTML parser I give a solemn nod to the author. And then wish them the speediest exit possible from that gig.

cheers Chris Maunder

V 0

Somehow, I immediately thought of this when I saw the title of your post. Enjoy[^] :)

V.

(MQOTD rules and previous solutions)

OriginalGriff

PIEBALDconsult wrote:

the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun

You're a cruel, cruel man. I like it. :thumbsup:

Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...

Brisingr Aerowing

AngleSharp is easily one of the best parsers out there. And it seems Firefox doesn't think parsers is a word and wants it to be passer or parers.

What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???

Denis A Stoyanov

So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun.

Then a few days later after he is broken just send him this piece of art.

Middle Manager

Thanks for the listen man! :thumbsup: I now want to kick ass on this morning. :-D

PIEBALDconsult

I'm beginning to think that the HtmlAgilityPack uses RegularExpressions. :sigh: I'll have to try AngleSharp. Oh, look, an article... :-D

Brisingr Aerowing

A quick look at the HAP source code and it seems they parse it character by character. I guess that's why it was so slow (it spent over three minutes 'parsing') when I tested it on a 1298 line HTML file (I can't remember where I found that file). AngleSharp parsed the same file much faster (in a few seconds).

What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???