Bottom Up Parsing is Frustrating

honey the codewitch

The way it works is it takes a series of tokens/lexemes from an input stream and uses pattern matching to discern whether they match various rules like: Foobar -> "foo" "bar" Such that when it sees "foo" followed by "bar" in the input stream it knows the result is a Foobar The upside of this is it's got a whole lot more recognizing power than starting from the point of Foobar and then checking the input stream for "foo" followed by "bar" (basically the reverse of the aforementioned process) The downside is it means the tree is built from leaves to the root, meaning bottom up, meaning your recursive tree building is effectively "inside out", building the children first. That's not normally a problem but it has some important ramifications, like if there's an error token in the tree it can't readily resolve the rest of the tree once encountered. The other issue is it's more confusing from the standpoint of the end user of the parser. Sure once it's in the tree, it's all the same, but if you're using a pull parser rather than loading everything into an in memory tree, the pull parser for a bottom up parser is just weird to use. And then the primary issue I have with them is that the algorithms to construct the parse tables are incomprehensible. I've written them before and I still hesitate to say I understand them. I have code that does it that I won't touch because of it. It's not the fault of the code - it's the algorithm that's ugly. The math is atrocious - it feels as though designed while drunk except that it's so bloody clever - too clever to understand by mere mortals. :sigh: I don't like it. Unfortunately it's a very practical way to parse, so it can't really be overlooked.

Real programmers use butterflies

Greg Utas

That's why this[^] uses a recursive descent parser. It's not as efficient as bottom-up, but it's decent enough if the possibilities are tried in probabilistic order, especially in this day and age of code that uses CPU time and memory frivolously. The code is the grammar and is usually easy to fix when a bug is found. No need to get the grammar close to right from the outset, and no need for an inscrutable algorithm to generate parse tables. Skipping over errors can still be difficult, but it depends where they occur.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

honey the codewitch

You get most of those same advantages using LL(1). Besides, if your grammar language is easy enough, grammars aren't so bad. Here's one for JSON:

Json= Object | Array;
Object= "{" [ Field { "," Field } ] "}";
Field= String ":" Value;
Array= "[" [ Value { "," Value } ] "]";
Value= String |
Number |
Object |
Array |
Boolean |
Null ;

Number= '\-?(0|[1-9][0-9]*)(\.[0-9]+)?([Ee][\+\-]?[0-9]+)?';
String = '"([^\0-\x1F"\\]|\\([\\/bfnrt]|u[a-fA-F0-9]{4}))*"';
Boolean = 'true|false';
Null="null";
Lbracket="[";
Rbracket="]";
Lbrace="{";
Rbrace="}";
Colon=":";
Comma=",";
Whitespace='[ \t\r\n\f\v]+';

See? no thing. All the attributes in there are just to prune the parse tree. "collapsed" removes items from the parse tree once they are parsed so you don't have to deal with them in your code.

Real programmers use butterflies

Greg Utas

I'd never seen a JSON grammar, so thanks for that. :thumbsup: But the one I'm dealing with is C++, which is surely a dog's breakfast. :)

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

honey the codewitch

Yeah, for C++ for a number of reasons it's best to use a hand written parser.

Real programmers use butterflies