XML parser generators?

Jorgen Sigvardsson

After having read up on XSD I've come to the conclusion that it should be possible to "compilers" for XML documents. With compiling I mean reading the XML input and turn it into raw & ready to use data. To clarify what I mean:

<people>
<person>
<name>Jörgen</name>
<age>27</age>
</person>
<person>
<name>Mr Who</name>
<age>123</age>
</person>
</people>

This particular document should be representable using a C++ structure like this:

struct person {
std::string name;
int age;
};
typedef std::list<person> people;

This can all be done using various techniques such as DOM or SAX, and some hand written code. However, writing stuff like this for large and complex XML documents is a tedious task, and possibly quite error prone. Processing DOM trees is a simple, but lazy way to deal with it. DOM trees are bloated and contains a lot of information which is basically redundant information for the application. I am sure it is possible to write a parser generator which

reads XSD definitions
generates C++ structures based on XSD information
generates a SAX based parser/compiler which instantiates the C++ structures and populates them with data from the XML document

I'm guessing such code would be quite non-bloated and fast (at least compared to DOM). Does anybody know if this has been done? -- Tune your mind, reach inside, peel away Touch, Taste, Feel, Saturation

Michael A Barnhart

Just some comments. I am not sure if exactly what you ask for has been implemented but code to create XSD from XML files and build entry forms exist so the basics are there. Now I have a question. You still need to know the structure to program against so maybe I am just missing some clever implementation but you still have to create code to work with that specific structure correct? and since it has all of the information it is not drastically different than the DOM model. Given this is true code like Kristen Wegner's PUGXML[^] is not that much different from what I hear you asking for. Yes you still step down through the DOM tree but you would in your classes also. You are trading the C++ class overhead for the overhead of the XML file in memory. IMhO gaining familiarity with a DOM or SAX implementation would be more profitable than having to relearn each custom generated class. Am I missing your intent? ""

Jorgen Sigvardsson

Michael A. Barnhart wrote: You still need to know the structure to program against Yes, this is correct. But if you do have a schema for the data, that is not really a problem is it? The schema is a beautiful thing: it has structure and it has types - all of them transferable into C++. Michael A. Barnhart wrote: and since it has all of the information it is not drastically different than the DOM model. The DOM model must maintain a lot of unnecessary information which I am not interested in. At least if you take into account that I already know structure and datattypes. If I take a look at IXMLDOMNode in MSXML 4 for instance, I find that most of its properties are totally unneccesary if I already know the structure and type information of the data to begin with! Names for instance, are maintained by the compiler (names of structs and members), "other node references" such as first, last, parent, are not interesting - I already know the structure. Not to mention the underlying XML text code! I guess what I want to do is pretty much what template programmers want to do: bind as much information as possible at compile time rather than runtime. Michael A. Barnhart wrote: You are trading the C++ class overhead for the overhead of the XML file in memory. Well, compared to a DOM document, I am sure I'd have huge advantages. I'd have a "distilled" collection of data, while the DOM tree would have a very verbose collection of data + a text copy of the entire XML document file in memory. Michael A. Barnhart wrote: IMhO gaining familiarity with a DOM or SAX implementation would be more profitable than having to relearn each custom generated class. Am I missing your intent? I'm basically saying that DOM and SAX are crude interfaces for dealing with entire documents if you don't want to do fancy operations such as transformations. If you are only interested in the data, then C++ structures and STL datatypes and structures would give you a very nice interface - it would model the data 1:1 basically. A DOM interface models a general tree, which is also untyped. Sure, you've had the DOM tree validated against a XSD, but you'll still have to do conversions etc. In essence, I just want a faster and more static version of DOM. :) -- Tune your mind, reach inside, peel away Touch, Taste, Feel, Saturation

Michael A Barnhart

Jörgen Sigvardsson wrote: In essence, I just want a faster and more static version of DOM. OK, and I see the benefit of doing this if it is something you run all of the time. Much of what I have to work with is not as stable as one would like. So flexibility is my biggest driver. I do not see the typical developer doing this a lot of the time. You would create a class and struct for the needed data and have a XML import export functions in it that would be SAX event driven if anything but small files are handled. So you save a few hours when adding a new data structure. How often do you expect to add new data structures? Yes if one person takes the weeks to built it, many can benefit. To truly cover all of the schema variations could be a bit taxing on someone. :-D ""

Stuart Dootson

If you mean something like xsd.exe[^] that comes with Visual Studio.NET, or JAXB[^], you're out of luck currently as far as free tools go (of course, xsd.exe isn't free, but you may well have VS.NET anyway...). However, xsd does say that it'll support anything that implements System.CodeDom.Compiler.CodeDomProvider, which C++ may do when Visual Studio 2003 arrives. Of course, it would still only be Managed C++.... Other alternatives (all for Java :-() are Castor[^], Jaxme[^] and Jibx[^] Of course, if you've got money to spend, there are C++ options like xmlspy[^] and RogueWave's XML Object Link[^] Stuart Dootson 'Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p'

Jorgen Sigvardsson

Thank you very much for this information! I was beginning to sketch a little on an implementation very much like RogueWaves XML Object Link. I think I will look into this CodeDom-stuff. Maybe I could write my own classes to generate "pure" C++. Looks like I can forget that. :) -- Tune your mind, reach inside, peel away Touch, Taste, Feel, Saturation

Anonymous

XML Data Binding tools provide this, but you need to describe your xml in a schema (xsd/xdr/dtd). http://www.rpbourret.com/xml/XMLDataBinding.htm lists available tools. I use the wizard from Liquid Technologies: http://www.liquid-technologies.com/Products/LXDBWizard.htm