HTML into XML
-
I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:
-
I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:
Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?
-
Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?
-
Stephan Samuel wrote:
HTML is a subset of XML
No HTML is not a subset of XML. XHTML is XML. HTML is not XML.
led mike wrote:
No HTML is not a subset of XML. XHTML is XML. HTML is not XML.
True, and I stand corrected, but if you've got non-XHTML HTML, good luck loading it into anything other than a browser. Luckily, many modern sites deliver XHTML. I don't know of anything that converts HTML to XHTML, but I'm sure it's been written. Writing one yourself is an interesting regex exercise that'll be left to the reader. Short of that, there's always running string processing routines on the HTML and whacking the results into an XML DOM. Seems like it'd be an extra step in many situations, though.
-
led mike wrote:
No HTML is not a subset of XML. XHTML is XML. HTML is not XML.
True, and I stand corrected, but if you've got non-XHTML HTML, good luck loading it into anything other than a browser. Luckily, many modern sites deliver XHTML. I don't know of anything that converts HTML to XHTML, but I'm sure it's been written. Writing one yourself is an interesting regex exercise that'll be left to the reader. Short of that, there's always running string processing routines on the HTML and whacking the results into an XML DOM. Seems like it'd be an extra step in many situations, though.
Stephan Samuel wrote:
I don't know of anything that converts HTML to XHTML, but I'm sure it's been written.
There have been attempts. Last time I checked ( 18 months or so), I was unable to find anything that actually worked on "real" HTML. In other words it worked depending on the HTML so... sometimes. :)
-
Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?
Not true, there's a lot of tags that do not respect XML format like "br" "img" "input" and you are not force to close a tag in HTML you can leave it open if you want, that is where I am having my problem... Thanks for the reply :wtf:
-
Stephan Samuel wrote:
I don't know of anything that converts HTML to XHTML, but I'm sure it's been written.
There have been attempts. Last time I checked ( 18 months or so), I was unable to find anything that actually worked on "real" HTML. In other words it worked depending on the HTML so... sometimes. :)
Yep! I found some components but I couldn't fine one that worked all the time... They worked for simple scenarios but complex scenarios they didn't.. :wtf::rolleyes:
-
I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:
HtmlTidy [^]will read in badly formed html and can chuck out well-formed xhtml for you. You may be able to use either P/Invoke or if you prefer you can use the command-line version. I havn't investigated. HtmlTidy is recommended by the W3C for tidying up code and as far as I know it's the only one that people accept works almost all of the time. In fact I'm suprised you havn't come across it. ;P
You know you're a Land Rover owner when the best route from point A to point B is through the mud. Ed
-
I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:
Hi, You can try this: HTML TO XML It's free. Eran Aharonovich (eran.aharonovich@gmail.com ) Noviway