parsing interactive text
-
Member 14968771 wrote:
what to ask Mrs Google to help me write a C++ code to
Try: "How to learn C++". Give me about 196,000,000 results
Mircea
-
Member 14968771 wrote:
is "tokenization" a good search word ??
Fair enough, but that is only the very beginning. A precursor to parsing. Tokenization is the chopping into atomic pieces of the input text, with no concern for how they are put together. All the tokenizer knows is how to delimit a symbol (token): That a word symbol start with a alphabetic and continues through alphanumerics but ends at the first non-alphanumeric - the tokenizer doesn't know or care whether the word is a variable name, a reserved word or something else. If it finds a digit, it devours digits. If the first non-digit is a math operator or a space, it has found an integer token. If it is a decimal point or an E (and the language permits exponents in literals), the token is a (yet incomplete) float value, and so on. The only language specific thing that the tokenizer needs to know is how to identify the end of a token. Once it has chopped the source code into pieces, its job is done. Parsing is identifying the structures formed by the tokens. Identifying block, loops, conditional statements etc. The borderline isn't necessarily razor sharp. Some would say that when the tokenizer finds an integer literal token, it might as well take the the task of converting it to a binary numeric token value, to be handed to the parser. That might be unsuitable in untyped languages where a numeric literal may be treated as a string. After identifying a word symbol, it might search a table of reserved words, possibly delivering it to the parser as a reserved word token. Again, in some languages this is unsuitable (and lots of people would say it goes far beyond a tokenizer's responsibility). If you want to analyze some input, doing an initial tokenization before starting the actual parsing is a good idea. Most compilers do that. Curious memory: One of my fellow students was in his first job after graduation set to identify bacteria in microscope photos. That was done by parsing: They had BNF grammars for different kinds of bacteria, and the image information was parsed according to the various grammars. If the number of parsing errors was too high, the verdict was 'Nope - it surely isn't that kind of bacteria, let me try another one!' Those grammars with a low error count was handed over to a human expert for confirmation, or possibly making a choice between viable alternatives, if two or more grammars gave a low error count. This mechanism took a lot of trivial work off t
-
Member 14968771 wrote:
is "tokenization" a good search word ??
Fair enough, but that is only the very beginning. A precursor to parsing. Tokenization is the chopping into atomic pieces of the input text, with no concern for how they are put together. All the tokenizer knows is how to delimit a symbol (token): That a word symbol start with a alphabetic and continues through alphanumerics but ends at the first non-alphanumeric - the tokenizer doesn't know or care whether the word is a variable name, a reserved word or something else. If it finds a digit, it devours digits. If the first non-digit is a math operator or a space, it has found an integer token. If it is a decimal point or an E (and the language permits exponents in literals), the token is a (yet incomplete) float value, and so on. The only language specific thing that the tokenizer needs to know is how to identify the end of a token. Once it has chopped the source code into pieces, its job is done. Parsing is identifying the structures formed by the tokens. Identifying block, loops, conditional statements etc. The borderline isn't necessarily razor sharp. Some would say that when the tokenizer finds an integer literal token, it might as well take the the task of converting it to a binary numeric token value, to be handed to the parser. That might be unsuitable in untyped languages where a numeric literal may be treated as a string. After identifying a word symbol, it might search a table of reserved words, possibly delivering it to the parser as a reserved word token. Again, in some languages this is unsuitable (and lots of people would say it goes far beyond a tokenizer's responsibility). If you want to analyze some input, doing an initial tokenization before starting the actual parsing is a good idea. Most compilers do that. Curious memory: One of my fellow students was in his first job after graduation set to identify bacteria in microscope photos. That was done by parsing: They had BNF grammars for different kinds of bacteria, and the image information was parsed according to the various grammars. If the number of parsing errors was too high, the verdict was 'Nope - it surely isn't that kind of bacteria, let me try another one!' Those grammars with a low error count was handed over to a human expert for confirmation, or possibly making a choice between viable alternatives, if two or more grammars gave a low error count. This mechanism took a lot of trivial work off t
trønderen wrote:
compiling bacteria!
Is that making bugs from bacteria, or vice versa?
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
-
Member 14968771 wrote:
is "tokenization" a good search word ??
Fair enough, but that is only the very beginning. A precursor to parsing. Tokenization is the chopping into atomic pieces of the input text, with no concern for how they are put together. All the tokenizer knows is how to delimit a symbol (token): That a word symbol start with a alphabetic and continues through alphanumerics but ends at the first non-alphanumeric - the tokenizer doesn't know or care whether the word is a variable name, a reserved word or something else. If it finds a digit, it devours digits. If the first non-digit is a math operator or a space, it has found an integer token. If it is a decimal point or an E (and the language permits exponents in literals), the token is a (yet incomplete) float value, and so on. The only language specific thing that the tokenizer needs to know is how to identify the end of a token. Once it has chopped the source code into pieces, its job is done. Parsing is identifying the structures formed by the tokens. Identifying block, loops, conditional statements etc. The borderline isn't necessarily razor sharp. Some would say that when the tokenizer finds an integer literal token, it might as well take the the task of converting it to a binary numeric token value, to be handed to the parser. That might be unsuitable in untyped languages where a numeric literal may be treated as a string. After identifying a word symbol, it might search a table of reserved words, possibly delivering it to the parser as a reserved word token. Again, in some languages this is unsuitable (and lots of people would say it goes far beyond a tokenizer's responsibility). If you want to analyze some input, doing an initial tokenization before starting the actual parsing is a good idea. Most compilers do that. Curious memory: One of my fellow students was in his first job after graduation set to identify bacteria in microscope photos. That was done by parsing: They had BNF grammars for different kinds of bacteria, and the image information was parsed according to the various grammars. If the number of parsing errors was too high, the verdict was 'Nope - it surely isn't that kind of bacteria, let me try another one!' Those grammars with a low error count was handed over to a human expert for confirmation, or possibly making a choice between viable alternatives, if two or more grammars gave a low error count. This mechanism took a lot of trivial work off t
Thank you very much for such extensive replay. Very unexpected , considering the other "clowns contributions " . I hope they, the other replies, are not an indicators of this site turning into social media... I have started my coding and it looks as I have to parse out non ascii alphanumeric characters first.
-