Oh, that ol' Cthulhu sure is sneaky...
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
My favorite is AngleSharp[^]
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
-
My favorite is AngleSharp[^]
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
We're moving off the AgilityPack onto AngleSharp.
cheers Chris Maunder
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
One of my earliest gigs was writing and XML, and then HTML, parser. I learned why browsers treat HTML so differently, but never learned why browser writers were so pig-headed in their insistence on sticking to clearly ludicrous decisions when ambiguity in the "spec" surfaced. As it did often back then. So everytime I see a HTML parser I give a solemn nod to the author. And then wish them the speediest exit possible from that gig.
cheers Chris Maunder
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
PIEBALDconsult wrote:
the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun
You're a cruel, cruel man. I like it. :thumbsup:
Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...
-
We're moving off the AgilityPack onto AngleSharp.
cheers Chris Maunder
AngleSharp is easily one of the best parsers out there. And it seems Firefox doesn't think parsers is a word and wants it to be passer or parers.
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???
-
But he won't catch me so easily. :cool: I've passed along links to Parsing Html The Cthulhu Way[^] many times so I always have the issue in mind. I usually read HTML with an XmlDocument (when I can) or the WinForms WebBrowser control, and I've seen others recommending the HTML Agility Pack. This week I received a bunch of large HTML files to scrape. They're not well-formed XML -- no surprise there. So I decided that this would be a good opportunity to try the HTML Agility Pack. It was able to read a sample, but it complained about
“Start tag <td> was not found”
-- which was surprising. The problem? Several elements like this:<th style="width: 5%"><!-- rule --></td>
:omg: The WinForms WebBrowser control is also able to read it, but the two tools treat it slightly differently and my initial feeling is that the WebBrowser handles it a little better. So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun. :badger:
So, the next time you encounter a developer who insists on consuming HTML with RegEx, pass them a sample like that, sit back, and watch the fun.
Then a few days later after he is broken just send him this piece of art.
-
Somehow, I immediately thought of this when I saw the title of your post. Enjoy[^] :)
V.
(MQOTD rules and previous solutions)
Thanks for the listen man! :thumbsup: I now want to kick ass on this morning. :-D
-
We're moving off the AgilityPack onto AngleSharp.
cheers Chris Maunder
I'm beginning to think that the HtmlAgilityPack uses RegularExpressions. :sigh: I'll have to try AngleSharp. Oh, look, an article... :-D
-
I'm beginning to think that the HtmlAgilityPack uses RegularExpressions. :sigh: I'll have to try AngleSharp. Oh, look, an article... :-D
A quick look at the HAP source code and it seems they parse it character by character. I guess that's why it was so slow (it spent over three minutes 'parsing') when I tested it on a 1298 line HTML file (I can't remember where I found that file). AngleSharp parsed the same file much faster (in a few seconds).
What do you get when you cross a joke with a rhetorical question? The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism. Do questions with multiple question marks annoy you???