Who here knows what a pull parser or pull parsing is?
-
honey the codewitch wrote:
our parsers are fundamentally different
Yes. I suppose the biggest conceptual difference between ours is that I needed to write a fairly general loader utility which could read a "script" and perform the tasks, not write several purpose-built utilities -- one for each file to be loaded. The ability to have it support CSV (and XML) as well as JSON was an afterthought.
honey the codewitch wrote:
I parse no numbers, no strings, nothing, unless you actually request it.
Well, mine too. It does have to tokenize so it knows when it finds something you want it to parse, but nothing more than that until it finds a requested array. If the script being run says, "if you find the start of an array named 'Widgets', then do this with it", then the parser has to know "I just found an array named 'Widgets'".
honey the codewitch wrote:
you are normalizing unconditionally at the parser level
Well, I suppose so, insofar as I make values (or names) out of every token, but at that point they're just strings -- name/value pairs with a type -- they're not parsed. I throw only those strings which we want at the SQL Server and it handles any conversions to numeric or other types, the loader has no say in that. The loader has no say in data normalization either, it's just passing values as SQL parameters. Again, I want nearly every value in the file to go to the database, so of course I wind up with every value and throw them all at SQL Server. It may be a misunderstanding of terms, but in my opinion, no actual "parsing" is done until the (string) values arrive at SQL Server -- that's where the determinations of which name/value pairs go where, what SQL datatype they should be, etc. happens. The loader utility has no knowledge of any of that.
I'm using parsing in the traditional CS sense of imposing structure on a lexical stream based on patterns in said stream.
Real programmers use butterflies
-
honey the codewitch wrote:
our parsers are fundamentally different
Yes. I suppose the biggest conceptual difference between ours is that I needed to write a fairly general loader utility which could read a "script" and perform the tasks, not write several purpose-built utilities -- one for each file to be loaded. The ability to have it support CSV (and XML) as well as JSON was an afterthought.
honey the codewitch wrote:
I parse no numbers, no strings, nothing, unless you actually request it.
Well, mine too. It does have to tokenize so it knows when it finds something you want it to parse, but nothing more than that until it finds a requested array. If the script being run says, "if you find the start of an array named 'Widgets', then do this with it", then the parser has to know "I just found an array named 'Widgets'".
honey the codewitch wrote:
you are normalizing unconditionally at the parser level
Well, I suppose so, insofar as I make values (or names) out of every token, but at that point they're just strings -- name/value pairs with a type -- they're not parsed. I throw only those strings which we want at the SQL Server and it handles any conversions to numeric or other types, the loader has no say in that. The loader has no say in data normalization either, it's just passing values as SQL parameters. Again, I want nearly every value in the file to go to the database, so of course I wind up with every value and throw them all at SQL Server. It may be a misunderstanding of terms, but in my opinion, no actual "parsing" is done until the (string) values arrive at SQL Server -- that's where the determinations of which name/value pairs go where, what SQL datatype they should be, etc. happens. The loader utility has no knowledge of any of that.
PIEBALDconsult wrote:
Well, mine too. It does have to tokenize so it knows when it finds something you want it to parse,
I have other ways of finding something. I switch to a fast matching algorithm where I basically look for a quote as if the document were a flat stream of characters and not a hierarchical ordered structure of logical JSON elements. That's what I mean by partial parsing and part of what I mean by denormalized searching/scanning. It ignores swaths of the document until it finds what you want. For example
reader.skipToField("name",JsonReader::Forward);
This performs the type of flat match that I'm talking about.
reader.skipToField("name",JsonReader::Siblings);
This performs a partially flat and partially structured match, looking for name on this level of the object heirarchy.
reader.skipToField("name",JsonReader::Descendants);
This does a nearly flat match, but basically counts '{' and '}' so it knows when to stop searching. I've simplified the explanation of what I've done, but that's the gist. I also don't load strings into memory at all when comparing them. I compare one character at a time straight off the "disk" so I never know the whole field name unless it's the one I'm after.
Real programmers use butterflies
-
It's usually in reference to XML parsers, but it's a generic parsing model that can apply to parsing anything. Contrast .NET's XmlTextReader (a pull parser) with a SAX XML parser (a push parser) The reason I ask is because I use the term a lot in my articles lately, and I'm trying to figure out if it might be worth it to write an article about the concept. I don't want to waste time with it if it's something most people have heard of before. It's hard for me to know because I deep dove parsing for a year and everything is familiar to me now.
Real programmers use butterflies