HTML Parser in C++
-
Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm
-
Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm
-
Vickie I been looking for a parser myself, I need to parse raw HTML fetched from a Winsock connection. In the end I wrote my own, but it is specific in the way that it only parses links. If you get a copy of WebFerret (www.webferret.com) and do a search for "C++ HTML Parser" you'll find some a parser written for Unix, but I'm sure with little effort could be ported over to Win32. Regards Norm
In my program, I just want to parse the contents, headings, frames and links. However, followed what you've adviced, I could only find XML parser in C/C++ but none for html. Can you give me more advice or if you can show me some coding? Thanks! Regards, Vickie
-
In my program, I just want to parse the contents, headings, frames and links. However, followed what you've adviced, I could only find XML parser in C/C++ but none for html. Can you give me more advice or if you can show me some coding? Thanks! Regards, Vickie
If you don't need to format text, this might not be too difficult. Hmmm... lets say you can load the whole page into, say, a CString. Work directly with the file stream if you like, but a CString or
string
might allow less code for a prototype. Every item you're interested in will start with a '<' character. As you travel through the stringFind
ing this character, you examine the chars that follow (perhaps usingCompare
andLeft(position, strlen(tag))
(remember to discard initial white space) to see if they match one of the keywords you're interested in (H1, H2, href etc.) and take action. If the chars after the the '<' don't match anything you need, just go back to looking for the next '<'. This is very much 'context free' parsing, and is something that HTML lends itself well to - writing a browser gets a lot more complicated, but for what you want to do this might be a good start. Probably only 'context' you have to deal with is remembering you have seen a <tag> when you encouner a </tag> There are certainly some details to work out as to how you store and parse the tags you're looking for, but start small and you'll probably find a reasonably efficient setup. I bet you could do something quite elegant with the STL containers, if that's your cup of tea. You might also search for some web spider code - there's bound to be some link parsing you can borrow in that type of app. -
If you don't need to format text, this might not be too difficult. Hmmm... lets say you can load the whole page into, say, a CString. Work directly with the file stream if you like, but a CString or
string
might allow less code for a prototype. Every item you're interested in will start with a '<' character. As you travel through the stringFind
ing this character, you examine the chars that follow (perhaps usingCompare
andLeft(position, strlen(tag))
(remember to discard initial white space) to see if they match one of the keywords you're interested in (H1, H2, href etc.) and take action. If the chars after the the '<' don't match anything you need, just go back to looking for the next '<'. This is very much 'context free' parsing, and is something that HTML lends itself well to - writing a browser gets a lot more complicated, but for what you want to do this might be a good start. Probably only 'context' you have to deal with is remembering you have seen a <tag> when you encouner a </tag> There are certainly some details to work out as to how you store and parse the tags you're looking for, but start small and you'll probably find a reasonably efficient setup. I bet you could do something quite elegant with the STL containers, if that's your cup of tea. You might also search for some web spider code - there's bound to be some link parsing you can borrow in that type of app. -
Thanks a lot for your detailed advice! I am trying to write the parser now! Hope there is good news few days later!
Just noticed that Andrew Koenig / Barbera Moo 's article in Januarys CUJ has a routine (p48) for finding URLs in a text string. Worth a look, if you're interested in an STL solution.