Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. XML / XSL
  4. HTML into XML

HTML into XML

Scheduled Pinned Locked Moved XML / XSL
htmlxmlquestion
9 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F Offline
    F Offline
    Felipe Dalorzo
    wrote on last edited by
    #1

    I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:

    S E E 3 Replies Last reply
    0
    • F Felipe Dalorzo

      I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:

      S Offline
      S Offline
      Stephan Samuel
      wrote on last edited by
      #2

      Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?

      L F 2 Replies Last reply
      0
      • S Stephan Samuel

        Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?

        L Offline
        L Offline
        led mike
        wrote on last edited by
        #3

        Stephan Samuel wrote:

        HTML is a subset of XML

        No HTML is not a subset of XML. XHTML is XML. HTML is not XML.

        S 1 Reply Last reply
        0
        • L led mike

          Stephan Samuel wrote:

          HTML is a subset of XML

          No HTML is not a subset of XML. XHTML is XML. HTML is not XML.

          S Offline
          S Offline
          Stephan Samuel
          wrote on last edited by
          #4

          led mike wrote:

          No HTML is not a subset of XML. XHTML is XML. HTML is not XML.

          True, and I stand corrected, but if you've got non-XHTML HTML, good luck loading it into anything other than a browser. Luckily, many modern sites deliver XHTML. I don't know of anything that converts HTML to XHTML, but I'm sure it's been written. Writing one yourself is an interesting regex exercise that'll be left to the reader. Short of that, there's always running string processing routines on the HTML and whacking the results into an XML DOM. Seems like it'd be an extra step in many situations, though.

          L 1 Reply Last reply
          0
          • S Stephan Samuel

            led mike wrote:

            No HTML is not a subset of XML. XHTML is XML. HTML is not XML.

            True, and I stand corrected, but if you've got non-XHTML HTML, good luck loading it into anything other than a browser. Luckily, many modern sites deliver XHTML. I don't know of anything that converts HTML to XHTML, but I'm sure it's been written. Writing one yourself is an interesting regex exercise that'll be left to the reader. Short of that, there's always running string processing routines on the HTML and whacking the results into an XML DOM. Seems like it'd be an extra step in many situations, though.

            L Offline
            L Offline
            led mike
            wrote on last edited by
            #5

            Stephan Samuel wrote:

            I don't know of anything that converts HTML to XHTML, but I'm sure it's been written.

            There have been attempts. Last time I checked ( 18 months or so), I was unable to find anything that actually worked on "real" HTML. In other words it worked depending on the HTML so... sometimes. :)

            F 1 Reply Last reply
            0
            • S Stephan Samuel

              Um... HTML is a subset of XML. Just by the mere fact that you're getting HTML from the site, you already have XML. Are you trying to use a certain portion of it? Are you worried about the well-formedness of the HTML? Where are you having the problem?

              F Offline
              F Offline
              Felipe Dalorzo
              wrote on last edited by
              #6

              Not true, there's a lot of tags that do not respect XML format like "br" "img" "input" and you are not force to close a tag in HTML you can leave it open if you want, that is where I am having my problem... Thanks for the reply :wtf:

              1 Reply Last reply
              0
              • L led mike

                Stephan Samuel wrote:

                I don't know of anything that converts HTML to XHTML, but I'm sure it's been written.

                There have been attempts. Last time I checked ( 18 months or so), I was unable to find anything that actually worked on "real" HTML. In other words it worked depending on the HTML so... sometimes. :)

                F Offline
                F Offline
                Felipe Dalorzo
                wrote on last edited by
                #7

                Yep! I found some components but I couldn't fine one that worked all the time... They worked for simple scenarios but complex scenarios they didn't.. :wtf::rolleyes:

                1 Reply Last reply
                0
                • F Felipe Dalorzo

                  I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:

                  E Offline
                  E Offline
                  Ed Poore
                  wrote on last edited by
                  #8

                  HtmlTidy [^]will read in badly formed html and can chuck out well-formed xhtml for you.  You may be able to use either P/Invoke or if you prefer you can use the command-line version.  I havn't investigated. HtmlTidy is recommended by the W3C for tidying up code and as far as I know it's the only one that people accept works almost all of the time.  In fact I'm suprised you havn't come across it.  ;P


                  You know you're a Land Rover owner when the best route from point A to point B is through the mud. Ed

                  1 Reply Last reply
                  0
                  • F Felipe Dalorzo

                    I have a web spider once it finds a page I need to convert that page into XML, it can be any page with many different formats. Is there a component I can use for that? Does anyone know a really nice and fast way to do it? Greetings :omg:

                    E Offline
                    E Offline
                    Eran Aharonovich
                    wrote on last edited by
                    #9

                    Hi, You can try this: HTML TO XML It's free. Eran Aharonovich (eran.aharonovich@gmail.com ) Noviway

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups