Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C / C++ / MFC
  4. HTML Parser in C++

HTML Parser in C++

Scheduled Pinned Locked Moved C / C++ / MFC
c++htmlquestion
7 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V Offline
    V Offline
    Vickie
    wrote on last edited by
    #1

    Does anyone have HTML Parser in Visual C++ for reference? Thank you.

    N 2 Replies Last reply
    0
    • V Vickie

      Does anyone have HTML Parser in Visual C++ for reference? Thank you.

      N Offline
      N Offline
      NormDroid
      wrote on last edited by
      #2

      Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm

      1 Reply Last reply
      0
      • V Vickie

        Does anyone have HTML Parser in Visual C++ for reference? Thank you.

        N Offline
        N Offline
        NormDroid
        wrote on last edited by
        #3

        Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm

        V 1 Reply Last reply
        0
        • N NormDroid

          Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm

          V Offline
          V Offline
          Vickie
          wrote on last edited by
          #4

          In my program, I just want to parse the contents, headings, frames and links. However, followed what you've adviced, I could only find XML parser in C/C++ but none for html. Can you give me more advice or if you can show me some coding? Thanks! Regards, Vickie

          T 1 Reply Last reply
          0
          • V Vickie

            In my program, I just want to parse the contents, headings, frames and links. However, followed what you've adviced, I could only find XML parser in C/C++ but none for html. Can you give me more advice or if you can show me some coding? Thanks! Regards, Vickie

            T Offline
            T Offline
            Tim Deveaux
            wrote on last edited by
            #5

            If you don't need to format text, this might not be too difficult. Hmmm... lets say you can load the whole page into, say, a CString. Work directly with the file stream if you like, but a CString or string might allow less code for a prototype. Every item you're interested in will start with a '<' character. As you travel through the string Finding this character, you examine the chars that follow (perhaps using Compare and Left(position, strlen(tag)) (remember to discard initial white space) to see if they match one of the keywords you're interested in (H1, H2, href etc.) and take action. If the chars after the the '<' don't match anything you need, just go back to looking for the next '<'. This is very much 'context free' parsing, and is something that HTML lends itself well to - writing a browser gets a lot more complicated, but for what you want to do this might be a good start. Probably only 'context' you have to deal with is remembering you have seen a <tag> when you encouner a </tag> There are certainly some details to work out as to how you store and parse the tags you're looking for, but start small and you'll probably find a reasonably efficient setup. I bet you could do something quite elegant with the STL containers, if that's your cup of tea. You might also search for some web spider code - there's bound to be some link parsing you can borrow in that type of app.

            V 1 Reply Last reply
            0
            • T Tim Deveaux

              If you don't need to format text, this might not be too difficult. Hmmm... lets say you can load the whole page into, say, a CString. Work directly with the file stream if you like, but a CString or string might allow less code for a prototype. Every item you're interested in will start with a '<' character. As you travel through the string Finding this character, you examine the chars that follow (perhaps using Compare and Left(position, strlen(tag)) (remember to discard initial white space) to see if they match one of the keywords you're interested in (H1, H2, href etc.) and take action. If the chars after the the '<' don't match anything you need, just go back to looking for the next '<'. This is very much 'context free' parsing, and is something that HTML lends itself well to - writing a browser gets a lot more complicated, but for what you want to do this might be a good start. Probably only 'context' you have to deal with is remembering you have seen a <tag> when you encouner a </tag> There are certainly some details to work out as to how you store and parse the tags you're looking for, but start small and you'll probably find a reasonably efficient setup. I bet you could do something quite elegant with the STL containers, if that's your cup of tea. You might also search for some web spider code - there's bound to be some link parsing you can borrow in that type of app.

              V Offline
              V Offline
              Vickie
              wrote on last edited by
              #6

              Thanks a lot for your detailed advice! I am trying to write the parser now! Hope there is good news few days later!

              T 1 Reply Last reply
              0
              • V Vickie

                Thanks a lot for your detailed advice! I am trying to write the parser now! Hope there is good news few days later!

                T Offline
                T Offline
                Tim Deveaux
                wrote on last edited by
                #7

                Just noticed that Andrew Koenig / Barbera Moo 's article in Januarys CUJ has a routine (p48) for finding URLs in a text string. Worth a look, if you're interested in an STL solution.

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups