But it feels so _diiirtyyy_!
-
So I have this method ( C# , .net ), let's call it
F(i)
wherei
is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of: Control (ASCII control characters) Delimiter (the caller can specify which characters are delimiters) EOF (-1) Escape (\) Non-ASCII (i > 127) Normal (ASCII characters which are not members of another classes) Quote (") This is implemented as an array look-up with acatch
forIndexOutOfRangeException
which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway. BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8).F(i)
was not performing well in this case. Apparently having thecatch
fire occasionally is OK, but firing a million times in rapid succession is decidedly not. Once I tracked the issue toF(i)
, I could try altering it to add a test fori > 127
and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch). That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive. Sooo... I named the original versionFa(i)
and the new versionFn(i)
and I madeF(i)
a delegate which starts out pointing toFa(i)
but: IfFa(i)
encounters a non_ASCII character it will re-pointF(i)
toFn(i)
IfFn(i)
encounters an ASCII character it will re-pointF(i)
toFa(i)
Slick as snot. Unit testing shows good performance. Time required to read the million non-ASCII characters with Fa == 12 seconds Time required to reHave you looked at the "standard" implementation of
isdigit
,isalpha
, etc. C functions? It uses a 128 character table in which different bits represent character class. Something like:tab['a'] = CLASS_ALPHA | CLASS_HEXDIGIT | CLASS_LOWERCASE;
For sure, you need to range limit the input. Just a thought. Disclaimer: Haven't looked if recent C runtime library implementations still use that "standard" implementation.
Mircea
-
Have you looked at the "standard" implementation of
isdigit
,isalpha
, etc. C functions? It uses a 128 character table in which different bits represent character class. Something like:tab['a'] = CLASS_ALPHA | CLASS_HEXDIGIT | CLASS_LOWERCASE;
For sure, you need to range limit the input. Just a thought. Disclaimer: Haven't looked if recent C runtime library implementations still use that "standard" implementation.
Mircea
Of course, but the built-in methods of
System.Char
(.net) don't suit my needs, so I rolled my own as is my wont. I actually have a number of places where I have to roll my own character (or byte) classing solution to meet my requirements. The "delimiter" class in particular depends on the type of file being read -- for JSON, the delimiters are: ,; for CSV they are, \r \n
. -
Of course, but the built-in methods of
System.Char
(.net) don't suit my needs, so I rolled my own as is my wont. I actually have a number of places where I have to roll my own character (or byte) classing solution to meet my requirements. The "delimiter" class in particular depends on the type of file being read -- for JSON, the delimiters are: ,; for CSV they are, \r \n
.I was talking about the method, not the functions themselves. The basic idea is that you can assign up to 8 distinct classes to each character. Fleshing out a bit more my idea, something like:
tab['{'] = CLASS_DELIM_JSON;
tab[','] = CLASS_DELIM_JSON | CLASS_DELIM_CSV;The table would of course be statically allocated and initialized:
bool is_a (int x, char what)
{
static char tab[128] = {/*blah, blah combinations of class bits*/ };
//preliminary tests go here
return (tab[(char)x] & what) != 0;
}Mircea
-
I was talking about the method, not the functions themselves. The basic idea is that you can assign up to 8 distinct classes to each character. Fleshing out a bit more my idea, something like:
tab['{'] = CLASS_DELIM_JSON;
tab[','] = CLASS_DELIM_JSON | CLASS_DELIM_CSV;The table would of course be statically allocated and initialized:
bool is_a (int x, char what)
{
static char tab[128] = {/*blah, blah combinations of class bits*/ };
//preliminary tests go here
return (tab[(char)x] & what) != 0;
}Mircea
Yes, and that's what I'm doing here -- but C#ishly, using an
enum
to define the constants and the map. -
So I have this method ( C# , .net ), let's call it
F(i)
wherei
is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of: Control (ASCII control characters) Delimiter (the caller can specify which characters are delimiters) EOF (-1) Escape (\) Non-ASCII (i > 127) Normal (ASCII characters which are not members of another classes) Quote (") This is implemented as an array look-up with acatch
forIndexOutOfRangeException
which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway. BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8).F(i)
was not performing well in this case. Apparently having thecatch
fire occasionally is OK, but firing a million times in rapid succession is decidedly not. Once I tracked the issue toF(i)
, I could try altering it to add a test fori > 127
and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch). That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive. Sooo... I named the original versionFa(i)
and the new versionFn(i)
and I madeF(i)
a delegate which starts out pointing toFa(i)
but: IfFa(i)
encounters a non_ASCII character it will re-pointF(i)
toFn(i)
IfFn(i)
encounters an ASCII character it will re-pointF(i)
toFa(i)
Slick as snot. Unit testing shows good performance. Time required to read the million non-ASCII characters with Fa == 12 seconds Time required to rePIEBALDconsult wrote:
BUT once in a while we receive a corrupt file...but firing a million times in rapid succession is decidedly not
Why does it continue after the first one? If firing once means it is corrupt and then you expect the rest to be corrupt then why not stop? Or say stop after 10 or so?
PIEBALDconsult wrote:
to add a test for i > 127 and avoid the exception (which I am loathe to do on principle).
Not sure I understand that statement. Unicode does not fit that range and you are getting unicode. So testing for it rather than assuming it is correct is what should be happening.
-
PIEBALDconsult wrote:
BUT once in a while we receive a corrupt file...but firing a million times in rapid succession is decidedly not
Why does it continue after the first one? If firing once means it is corrupt and then you expect the rest to be corrupt then why not stop? Or say stop after 10 or so?
PIEBALDconsult wrote:
to add a test for i > 127 and avoid the exception (which I am loathe to do on principle).
Not sure I understand that statement. Unicode does not fit that range and you are getting unicode. So testing for it rather than assuming it is correct is what should be happening.
The reader has no clue whether what it's reading is corrupt or not, that's determined at a higher level. The reader's job is to simply read the characters and return them.
jschell wrote:
Or say stop after 10 or so?
That's the issue I was running into, I had it throw an Exception after ten seconds, I needed to keep that from happening.
jschell wrote:
you are getting unicode
Yes, but the reader was taking too long to read the UNICODE characters which are outside the ASCII range. It now reads them almost as quickly as UNICODE characters which are within the ASCII range. But to do so, It's flipping between two implementations of a method -- one for ASCII and another for non-ASCII -- depending on which characters it encounters.
-
The reader has no clue whether what it's reading is corrupt or not, that's determined at a higher level. The reader's job is to simply read the characters and return them.
jschell wrote:
Or say stop after 10 or so?
That's the issue I was running into, I had it throw an Exception after ten seconds, I needed to keep that from happening.
jschell wrote:
you are getting unicode
Yes, but the reader was taking too long to read the UNICODE characters which are outside the ASCII range. It now reads them almost as quickly as UNICODE characters which are within the ASCII range. But to do so, It's flipping between two implementations of a method -- one for ASCII and another for non-ASCII -- depending on which characters it encounters.
PIEBALDconsult wrote:
what it's reading is corrupt or not, that's determined at a higher level.
That seems idealistic or odd. If it is or was throwing an exception then it seems like it did know it was corrupted. Obviously the caller continued since you said it was failing a million times. Was the caller getting usable information? If so then the architecture indicates that it is not in general corrupted. But rather that some of the data is not useable so that is how it should be treated. Conversely, to me, if I was seeing a million actual failures then I would question the need to attempt to retrieve 'valid' data. Corrupted generally would refer to some random process. That random process might create a character that passes your checks (regardless of how you check) but still represent bad data. Matter of fact the more data like this that exists the more likely that becomes. So "millions" would suggest that some other bad data would end up being accepted as good.
-
PIEBALDconsult wrote:
what it's reading is corrupt or not, that's determined at a higher level.
That seems idealistic or odd. If it is or was throwing an exception then it seems like it did know it was corrupted. Obviously the caller continued since you said it was failing a million times. Was the caller getting usable information? If so then the architecture indicates that it is not in general corrupted. But rather that some of the data is not useable so that is how it should be treated. Conversely, to me, if I was seeing a million actual failures then I would question the need to attempt to retrieve 'valid' data. Corrupted generally would refer to some random process. That random process might create a character that passes your checks (regardless of how you check) but still represent bad data. Matter of fact the more data like this that exists the more likely that becomes. So "millions" would suggest that some other bad data would end up being accepted as good.
jschell wrote:
seems like it did know it was corrupted.
No, the code had a bug whereby it "took too long" to read a value/token. I have fixed the bug so now the value/token gets read and returned and then the caller can determine whether or not the value is reasonable.
jschell wrote:
failing a million times
No, not failing a million times, failing once.
jschell wrote:
the architecture indicates that it is not in general corrupted
In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that. A large text value may contain a few non-ASCII characters, such as "smart quotes", and that's fine. The issue was that when a value which contains many non-ASCII characters, it took too long to read -- it also happens that the value was corrupt in this case.
jschell wrote:
Corrupted generally would refer to some random process.
Yes, and we still don't know what is causing the corruption.
jschell wrote:
random process might create a character that passes your checks (regardless of how you check) but still represent bad data
As is always the case, and again the reader can't determine that.
jschell wrote:
some other bad data would end up being accepted as good.
As is always the case, and again the reader can't determine that. In this particular case of corruption, the process will fail when it tries to stuff a million-plus characters into a database column which allows only two-hundred. But yes, even now there may be a file on the way which is corrupt in such a way that the bad value fits and the load won't fail. The reader won't care, the loader won't care, but some other part of the process will (probably) freak out.
-
jschell wrote:
seems like it did know it was corrupted.
No, the code had a bug whereby it "took too long" to read a value/token. I have fixed the bug so now the value/token gets read and returned and then the caller can determine whether or not the value is reasonable.
jschell wrote:
failing a million times
No, not failing a million times, failing once.
jschell wrote:
the architecture indicates that it is not in general corrupted
In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that. A large text value may contain a few non-ASCII characters, such as "smart quotes", and that's fine. The issue was that when a value which contains many non-ASCII characters, it took too long to read -- it also happens that the value was corrupt in this case.
jschell wrote:
Corrupted generally would refer to some random process.
Yes, and we still don't know what is causing the corruption.
jschell wrote:
random process might create a character that passes your checks (regardless of how you check) but still represent bad data
As is always the case, and again the reader can't determine that.
jschell wrote:
some other bad data would end up being accepted as good.
As is always the case, and again the reader can't determine that. In this particular case of corruption, the process will fail when it tries to stuff a million-plus characters into a database column which allows only two-hundred. But yes, even now there may be a file on the way which is corrupt in such a way that the bad value fits and the load won't fail. The reader won't care, the loader won't care, but some other part of the process will (probably) freak out.
PIEBALDconsult wrote:
No, not failing a million times, failing once.
You did say in the OP the following which seemed to me to suggest that it would have failed a million times.
"Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not."
PIEBALDconsult wrote:
In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that
Back to being too idealistic to me in terms of your original point of how the code should handle it. The method is only part of this process. It is not part of a multi-use library (presumably.) And the problem being solved does involve characters that are not ascii. So nothing wrong with this method specifically dealing with that by using the 'if' solution. You could, idealistically, make the caller deal with it before calling the method in the first place. To me even spending time on that consideration seems like overkill. It would not be worth my time for consideration.
-
PIEBALDconsult wrote:
No, not failing a million times, failing once.
You did say in the OP the following which seemed to me to suggest that it would have failed a million times.
"Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not."
PIEBALDconsult wrote:
In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that
Back to being too idealistic to me in terms of your original point of how the code should handle it. The method is only part of this process. It is not part of a multi-use library (presumably.) And the problem being solved does involve characters that are not ascii. So nothing wrong with this method specifically dealing with that by using the 'if' solution. You could, idealistically, make the caller deal with it before calling the method in the first place. To me even spending time on that consideration seems like overkill. It would not be worth my time for consideration.
Having the catch fire is not a failure.
jschell wrote:
It is not part of a multi-use library
Yes, it is, of course it is.
jschell wrote:
deal with it before calling the method
Uh, what? You can't handle characters before they've been read.