Ruminations on parsing.
-
All this talk of Context-Free Grammars has me wondering if there's such thing as a context-dependent grammar?
The difficult we do right away... ...the impossible takes slightly longer.
There are. Chomsky type 1 and type 0 languages are context dependent. They model human language. You can represent them with an Earley grammar, but an Earley grammar is not practical to parse on a real system. It's strictly theoretical. It's also possible to parse context sensitive languages with a context free grammar using a GLR parser. You will get multiple trees back due to the ambiguity of such languages without context. Your job would then be to decide which tree is valid.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
All this talk of Context-Free Grammars has me wondering if there's such thing as a context-dependent grammar?
The difficult we do right away... ...the impossible takes slightly longer.
Sure, English is a prime example. Note the "prime" ambiguity that can only be solved in a context-dependent manner. Some might argue that English doesn't really have a grammar per se, but mostly a collection of use cases and exceptions. :laugh:
Mircea
-
Probably a lot of you did this in school: You take a context-free-grammar in some variant of BNF or EBNF and you use it to generate parse tables you can use to parse structured text, like programming languages. After studying it off and on for years, teaching myself the concepts, building code generators, and using parser code generators already out there, I've come to the following conclusions: 1. For any non-LL(1) language of non-trivial complexity - say, a programming language for example, it is virtually always worth it to hand roll your own parser, as the code can be as much as an order of magnitude smaller, and more flexible, which since even languages with simplistic syntax like C still require dynamic introduction of symbols into the parse table while parsing, is pretty much required to parse something real world that is more than simple. 2. Even given #1 it may be worth it to use generated code to test hand rolled code, and to create a context free grammar to describe that language anyway. A grammar coded by a parser generator can test your hand rolled parser for correctness, and that CFG (grammar) can be used to document it. 3. Despite its power, bottom up parsing is not as elegant as top down parsing, and also #1 still applies, and you cannot realistically make a bottom up parser without generating it. 4. I spent a long time to come up with the above 3 little points. It was expensive, even as experience goes. I'm still trying to decide if it was worth it, all told, when I factor in how much computer science I learned in the process. I didn't get saddled financially for it though, so yay for that.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
:thumbsup: To parse C++, I used recursive descent and even wrote the lexical analysis routines from scratch. Fixing bugs is fairly straightforward, but I wouldn't have a clue how to fix them in a bottom-up parser. It's about 13K lines of code, including comments and blanks, and anyone familiar with C++ can look at the code and probably figure it out with relative ease. The "compiler" part, however, which has to understand what the parsed code is doing in order to perform static analysis, is much larger, probably 3x that size. robust-services-core/src/ct/Parser.cpp · GitHub[^]
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing. -
:thumbsup: To parse C++, I used recursive descent and even wrote the lexical analysis routines from scratch. Fixing bugs is fairly straightforward, but I wouldn't have a clue how to fix them in a bottom-up parser. It's about 13K lines of code, including comments and blanks, and anyone familiar with C++ can look at the code and probably figure it out with relative ease. The "compiler" part, however, which has to understand what the parsed code is doing in order to perform static analysis, is much larger, probably 3x that size. robust-services-core/src/ct/Parser.cpp · GitHub[^]
Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.Yeah, in my Slang parser the parsing was the easy part. Establishing a CodeDOM based AST after the fact was not. For example
class Foo {
static public int Bar;
}if I do Foo.Bar, do I create a CodeFieldReferenceExpression or a CodePropertyReferenceExpression, or even a CodeMethodReferenceExpression? I have to crawl the existing code I've already resolved to an AST to determine what Bar is, so I can then make the appropriate object. It's not compiling. It's doing something completely different, but it's very compiler-like.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
jschell wrote:
If you are going to call it a language then you probably really must do that.
Umm, I do? Lexicon: Context-Free-Grammar/CFG - The document describing the structure of the language Language - A Chomsky type 2 language describable with a CFG Parser - A stack based FA machine that can - given an input grammar - parse the corresponding language. Maybe you just didn't understand me or something.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
All this talk of Context-Free Grammars has me wondering if there's such thing as a context-dependent grammar?
The difficult we do right away... ...the impossible takes slightly longer.
Interesting question. So I went looking. The following programming language claims that at least some of it is context dependent. Chapter 4 - Expressions[^] "Words such as sell, at, and read have different meanings in different contexts. The words are relative expressions -- their meaning is context dependent."
-
honey the codewitch wrote:
Umm, I do?
I meant that as a general comment for all of those out there that create their own language. I suspect that at least some of them don't use a BNF.
Oh I see. I thought you meant I was being inconsistent about calling a language a language. BNF and EBNF are just "file formats" for lack of a better term. They are like
Foo:= Bar "+" Baz;
Bar:= "Bar";
Baz:= { "Baz" }+;(inexact representation as the specs are imprecise kind of like regex) it's just a format for a context free grammar specification. EBNF and BNF are the most well known formats, which is why I mentioned them.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Interesting question. So I went looking. The following programming language claims that at least some of it is context dependent. Chapter 4 - Expressions[^] "Words such as sell, at, and read have different meanings in different contexts. The words are relative expressions -- their meaning is context dependent."
It's indeed possible to apply context to a narrow parse path. In fact, even the C grammar requires this because introducing a new struct or typedef introduces a new non-terminal into the grammar. It can be had by "hacking" the parser in one particular area such that it can apply a specific and narrow kind of context. That's the how the context is represented in that particular case. However, a generalized mechanism for context is not really feasible. Chomsky Type 1 and 0 languages require context throughout in order to parse. They need something like an Earley grammar. Implementations of Earley grammars that have been proposed write new context-free-grammars on the fly during the parse. The problem with that is it takes a long time to turn a CFG into an actual parser. Generating the tables takes a lot of time. It's simply not practical. There are better ways of language processing that don't use this tech at all. See AI speech recognition. Edit: Looks like someone attacked it a different way as well: https://www.sciencedirect.com/science/article/pii/S2590118422000697[^] (paywalled)
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
It's indeed possible to apply context to a narrow parse path. In fact, even the C grammar requires this because introducing a new struct or typedef introduces a new non-terminal into the grammar. It can be had by "hacking" the parser in one particular area such that it can apply a specific and narrow kind of context. That's the how the context is represented in that particular case. However, a generalized mechanism for context is not really feasible. Chomsky Type 1 and 0 languages require context throughout in order to parse. They need something like an Earley grammar. Implementations of Earley grammars that have been proposed write new context-free-grammars on the fly during the parse. The problem with that is it takes a long time to turn a CFG into an actual parser. Generating the tables takes a lot of time. It's simply not practical. There are better ways of language processing that don't use this tech at all. See AI speech recognition. Edit: Looks like someone attacked it a different way as well: https://www.sciencedirect.com/science/article/pii/S2590118422000697[^] (paywalled)
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
honey the codewitch wrote:
Looks like someone attacked it a different way as wel
Interesting. Goes beyond what you said to point out that multiple languages are a mix. Although perhaps that is not surprising. From the link you posted. "Furthermore, despite the strength of CFGs, some aspects of modern programming languages cannot be modeled with context-free grammars alone as some language constructs depend on the wider context they appear in [12]. Most such cases must be dealt with during or after parsing using more or less formalized techniques, e.g., name resolution, type checking, etc., which are far less formalized than the parser itself. "
-
Oh I see. I thought you meant I was being inconsistent about calling a language a language. BNF and EBNF are just "file formats" for lack of a better term. They are like
Foo:= Bar "+" Baz;
Bar:= "Bar";
Baz:= { "Baz" }+;(inexact representation as the specs are imprecise kind of like regex) it's just a format for a context free grammar specification. EBNF and BNF are the most well known formats, which is why I mentioned them.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix