Musings on generalized and context sensitive parsing

honey the codewitch

Usually Generalized parsing algorithms are complicated to implement due to being non-deterministic, but very powerful in that they can parse ambiguous grammars. I really like GLR parsing because it's an elegant solution to a very complicated problem. It uses a difficult to implement LR parsing mechanism underneath but the G part of it is a relatively simple addition that supercharges the expressive power of the LR parser underneath and allows it to parse ambiguous grammars, like human language. So when I wrote the Glory GLR parser generator I basically jumped the shark. There's little need for something more expressive as it already accepts any grammar. The only way I can think to improve on the algorithm is to implement a contextful grammar, but there are practical problems with even attempting this. On contextful parsing Context-sensitive grammar - Wikipedia[^] - it's not exactly a mathematically solved problem - at least not for every day parsing. The problem is that a computer is a Linear Bounded Automata and those aren't powerful enough to compute every CSG you can throw at it in practical time, but rather only a subset of them can be computed. Recursively Enumerable Languages like English just aren't practically parsable using traditional means. My prediction is that a learning system will be how we can parse such languages. We'd use machine learning to apply contextful meaning instead of taking a classic mathematical approach to break down the language linguistically and formally to a machine. At best, we'll partially parse using context-free means, like with GLR, and then work on each of the multiple trees we get back using some form of ML, essentially using context-free parsing to get our basic structures and then applying context using ML. I believe that latter approach is already taken by some "AI" bot systems out there, like that ill fated microsoft twitter bot that turned into a Nazi. I can't be sure without looking at the code though. Either way, it's a challenging problem. I'm not sure I'm up for solving it on my own. I don't even have the hardware I'd need to try it for anything real world, nor the man hours and expertise to produce a usable English grammar.

Real programmers use butterflies

Member 12982558

well there is of course a long history of trying to parse natural languages, and a variety of extensions to CFG's to handle that, usually in the form of attributing grammars in whatever form. One form would be to add predicates to the grammar where the predicate is used to select the direction of the parse (this is basically why one prefers a handwritten recursive descent parser, since you can mix predicates with symbol recognizers) You might want to have a look at 2VW grammars (2 level van Wijngaarden grammars), they were used in the description of Algol 68. They are not "just" attribute grammars, since in most of the common attribute grammar description the parsetree is annotated and the attribute grammar is constrained such that in a finite - and preferably precomputed - number of passes the attributes can be computed. 2VW grammars do contain (kind of) predicates that allow (or forbid) certain derivations. Sine you seem to like exotic parsing techniques, handling some form of generalized attributes during the building of the parsetree seems an exercise to keep you busy for a couple of {weeks | Months | years} (select one)

honey the codewitch

I allowed for some measure of attributed grammars that could direct the parse in Parsley, but I've not found a practical way to implement 2 level van Wijngaard grammars (with all rewriting involved). And while they were probably used in the description of Algol 68 I bet it was human hands that implemented that grammar. That's my problem, is I'd want to cut out the human step. The GLR parser simply returns multiple trees for the different contexts and it's up to the consumer of those trees to select the right tree (and consequently applying context) - that's another route to go rather than 2VW, but 2VW does appeal to me in that it maintains formality and so it's I think possible, if not realistic to compute parsers for those grammars. However, they'd be slow.

Real programmers use butterflies

Member 12982558

Indeed the A68 description was person-made, not mechanically. It would be interesting though to see with what degree of constraints it would be possible to handle it mechanically. The most simple variant is of course an underlying LL(1) grammar with predicates (or general functions) between symbols in the rules, predicates that solely depend on the left context. However, as soon as you are able - what you are - to generate a multitude of parse trees you can evaluate constraints in a later stage and delete the trees where the predicate results are false. The question is of course not whether or not it is theoretically possible but whether or not you can formulate (levels of) constraints to make it practical. For AG's is is pretty well known how to limit them such that the attributes can be computed is a given number of scans over the tree, so given a finite amount of trees and a finite amount of scans per tree to see whether or not the tree is viable, extracting the valid tree is - in principle - solvable. Assuming in natural language processing the basic elements to parse and interpret are sentences, that should not give to many problems. I do not know much about natural language processing, I do know though that most sentences can be parsed in (many) different ways depending on the context. Context here in a very broad sense! (Due to sloppiness by most speakers the sentences spoken are inherently ambiguous and require knowledge of the background of the speaker to be able to extract the precise intention of the spoken words.) My examples would be in Dutch, so probably not very meaningful to you Anyway, good luck with you parser (parsing) development, although I am not very active in that field anymore, I'll keep an eye on your progress I

honey the codewitch

Member 12982558 wrote:

However, as soon as you are able - what you are - to generate a multitude of parse trees you can evaluate constraints in a later stage and delete the trees where the predicate results are false.

GLR parsing allows for precisely that, but the trick is determining which tree to use. :-D As I understand it, parsing using 2VW is generally polynomial time complexity. I'm not sure how that bakes out in the real world as I haven't implemented 2VW but my initial thought is it's probably impractical for large, real world grammars. I am sure that the grammar size is related to the performance, given how it works.

Real programmers use butterflies