Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. But it feels so _diiirtyyy_!

But it feels so _diiirtyyy_!

Scheduled Pinned Locked Moved The Weird and The Wonderful
csharpdata-structurestestingbeta-testingjson
11 Posts 3 Posters 58 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P Offline
    P Offline
    PIEBALDconsult
    wrote on last edited by
    #1

    So I have this method ( C# , .net ), let's call it F(i) where i is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of:   Control (ASCII control characters)   Delimiter (the caller can specify which characters are delimiters)   EOF (-1)   Escape (\)   Non-ASCII (i > 127)   Normal (ASCII characters which are not members of another classes)   Quote (") This is implemented as an array look-up with a catch for IndexOutOfRangeException which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway. BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8). F(i) was not performing well in this case. Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not. Once I tracked the issue to F(i), I could try altering it to add a test for i > 127 and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch). That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive. Sooo... I named the original version Fa(i) and the new version Fn(i) and I made F(i) a delegate which starts out pointing to Fa(i) but:   If Fa(i) encounters a non_ASCII character it will re-point F(i) to Fn(i)   If Fn(i) encounters an ASCII character it will re-point F(i) to Fa(i) Slick as snot. Unit testing shows good performance.   Time required to read the million non-ASCII characters with Fa == 12 seconds   Time required to re

    Mircea NeacsuM J 2 Replies Last reply
    0
    • P PIEBALDconsult

      So I have this method ( C# , .net ), let's call it F(i) where i is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of:   Control (ASCII control characters)   Delimiter (the caller can specify which characters are delimiters)   EOF (-1)   Escape (\)   Non-ASCII (i > 127)   Normal (ASCII characters which are not members of another classes)   Quote (") This is implemented as an array look-up with a catch for IndexOutOfRangeException which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway. BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8). F(i) was not performing well in this case. Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not. Once I tracked the issue to F(i), I could try altering it to add a test for i > 127 and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch). That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive. Sooo... I named the original version Fa(i) and the new version Fn(i) and I made F(i) a delegate which starts out pointing to Fa(i) but:   If Fa(i) encounters a non_ASCII character it will re-point F(i) to Fn(i)   If Fn(i) encounters an ASCII character it will re-point F(i) to Fa(i) Slick as snot. Unit testing shows good performance.   Time required to read the million non-ASCII characters with Fa == 12 seconds   Time required to re

      Mircea NeacsuM Offline
      Mircea NeacsuM Offline
      Mircea Neacsu
      wrote on last edited by
      #2

      Have you looked at the "standard" implementation of isdigit, isalpha, etc. C functions? It uses a 128 character table in which different bits represent character class. Something like:

      tab['a'] = CLASS_ALPHA | CLASS_HEXDIGIT | CLASS_LOWERCASE;

      For sure, you need to range limit the input. Just a thought. Disclaimer: Haven't looked if recent C runtime library implementations still use that "standard" implementation.

      Mircea

      P 1 Reply Last reply
      0
      • Mircea NeacsuM Mircea Neacsu

        Have you looked at the "standard" implementation of isdigit, isalpha, etc. C functions? It uses a 128 character table in which different bits represent character class. Something like:

        tab['a'] = CLASS_ALPHA | CLASS_HEXDIGIT | CLASS_LOWERCASE;

        For sure, you need to range limit the input. Just a thought. Disclaimer: Haven't looked if recent C runtime library implementations still use that "standard" implementation.

        Mircea

        P Offline
        P Offline
        PIEBALDconsult
        wrote on last edited by
        #3

        Of course, but the built-in methods of System.Char (.net) don't suit my needs, so I rolled my own as is my wont. I actually have a number of places where I have to roll my own character (or byte) classing solution to meet my requirements. The "delimiter" class in particular depends on the type of file being read -- for JSON, the delimiters are

        : ,
        ; for CSV they are , \r \n .

        Mircea NeacsuM 1 Reply Last reply
        0
        • P PIEBALDconsult

          Of course, but the built-in methods of System.Char (.net) don't suit my needs, so I rolled my own as is my wont. I actually have a number of places where I have to roll my own character (or byte) classing solution to meet my requirements. The "delimiter" class in particular depends on the type of file being read -- for JSON, the delimiters are

          : ,
          ; for CSV they are , \r \n .

          Mircea NeacsuM Offline
          Mircea NeacsuM Offline
          Mircea Neacsu
          wrote on last edited by
          #4

          I was talking about the method, not the functions themselves. The basic idea is that you can assign up to 8 distinct classes to each character. Fleshing out a bit more my idea, something like:

          tab['{'] = CLASS_DELIM_JSON;
          tab[','] = CLASS_DELIM_JSON | CLASS_DELIM_CSV;

          The table would of course be statically allocated and initialized:

          bool is_a (int x, char what)
          {
          static char tab[128] = {/*blah, blah combinations of class bits*/ };
          //preliminary tests go here
          return (tab[(char)x] & what) != 0;
          }

          Mircea

          P 1 Reply Last reply
          0
          • Mircea NeacsuM Mircea Neacsu

            I was talking about the method, not the functions themselves. The basic idea is that you can assign up to 8 distinct classes to each character. Fleshing out a bit more my idea, something like:

            tab['{'] = CLASS_DELIM_JSON;
            tab[','] = CLASS_DELIM_JSON | CLASS_DELIM_CSV;

            The table would of course be statically allocated and initialized:

            bool is_a (int x, char what)
            {
            static char tab[128] = {/*blah, blah combinations of class bits*/ };
            //preliminary tests go here
            return (tab[(char)x] & what) != 0;
            }

            Mircea

            P Offline
            P Offline
            PIEBALDconsult
            wrote on last edited by
            #5

            Yes, and that's what I'm doing here -- but C#ishly, using an enum to define the constants and the map.

            1 Reply Last reply
            0
            • P PIEBALDconsult

              So I have this method ( C# , .net ), let's call it F(i) where i is an integer representing a UTF-16 character. The method determines which one of the following classes the character is a member of:   Control (ASCII control characters)   Delimiter (the caller can specify which characters are delimiters)   EOF (-1)   Escape (\)   Non-ASCII (i > 127)   Normal (ASCII characters which are not members of another classes)   Quote (") This is implemented as an array look-up with a catch for IndexOutOfRangeException which will fire for EOF and non-ASCII characters. This has been working well for a while. The data (JSON files mostly, but not exclusively) is nearly all ASCII characters with only an occasional non-ASCII character -- maybe a few "smart-quotes" or similar, which are OK, in many cases I replace those with their ASCII versions anyway. BUT once in a while we receive a corrupt file which (in the latest case) includes a JSON value which contains more than a million non-ASCII characters (in the file they are encoded as three-byte UTF-8). F(i) was not performing well in this case. Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not. Once I tracked the issue to F(i), I could try altering it to add a test for i > 127 and avoid the exception (which I am loathe to do on principle). But unit testing did show that it improved the performance considerably for the non-ASCII characters without significantly hindering the performance of ASCII characters (EOF is still handled by a catch). That sounds like a win, except... I just don't like having the extra test which is essentially needless given that we don't expect any/many non-ASCII characters in most files we receive. Sooo... I named the original version Fa(i) and the new version Fn(i) and I made F(i) a delegate which starts out pointing to Fa(i) but:   If Fa(i) encounters a non_ASCII character it will re-point F(i) to Fn(i)   If Fn(i) encounters an ASCII character it will re-point F(i) to Fa(i) Slick as snot. Unit testing shows good performance.   Time required to read the million non-ASCII characters with Fa == 12 seconds   Time required to re

              J Offline
              J Offline
              jschell
              wrote on last edited by
              #6

              PIEBALDconsult wrote:

              BUT once in a while we receive a corrupt file...but firing a million times in rapid succession is decidedly not

              Why does it continue after the first one? If firing once means it is corrupt and then you expect the rest to be corrupt then why not stop? Or say stop after 10 or so?

              PIEBALDconsult wrote:

              to add a test for i > 127 and avoid the exception (which I am loathe to do on principle).

              Not sure I understand that statement. Unicode does not fit that range and you are getting unicode. So testing for it rather than assuming it is correct is what should be happening.

              P 1 Reply Last reply
              0
              • J jschell

                PIEBALDconsult wrote:

                BUT once in a while we receive a corrupt file...but firing a million times in rapid succession is decidedly not

                Why does it continue after the first one? If firing once means it is corrupt and then you expect the rest to be corrupt then why not stop? Or say stop after 10 or so?

                PIEBALDconsult wrote:

                to add a test for i > 127 and avoid the exception (which I am loathe to do on principle).

                Not sure I understand that statement. Unicode does not fit that range and you are getting unicode. So testing for it rather than assuming it is correct is what should be happening.

                P Offline
                P Offline
                PIEBALDconsult
                wrote on last edited by
                #7

                The reader has no clue whether what it's reading is corrupt or not, that's determined at a higher level. The reader's job is to simply read the characters and return them.

                jschell wrote:

                Or say stop after 10 or so?

                That's the issue I was running into, I had it throw an Exception after ten seconds, I needed to keep that from happening.

                jschell wrote:

                you are getting unicode

                Yes, but the reader was taking too long to read the UNICODE characters which are outside the ASCII range. It now reads them almost as quickly as UNICODE characters which are within the ASCII range. But to do so, It's flipping between two implementations of a method -- one for ASCII and another for non-ASCII -- depending on which characters it encounters.

                J 1 Reply Last reply
                0
                • P PIEBALDconsult

                  The reader has no clue whether what it's reading is corrupt or not, that's determined at a higher level. The reader's job is to simply read the characters and return them.

                  jschell wrote:

                  Or say stop after 10 or so?

                  That's the issue I was running into, I had it throw an Exception after ten seconds, I needed to keep that from happening.

                  jschell wrote:

                  you are getting unicode

                  Yes, but the reader was taking too long to read the UNICODE characters which are outside the ASCII range. It now reads them almost as quickly as UNICODE characters which are within the ASCII range. But to do so, It's flipping between two implementations of a method -- one for ASCII and another for non-ASCII -- depending on which characters it encounters.

                  J Offline
                  J Offline
                  jschell
                  wrote on last edited by
                  #8

                  PIEBALDconsult wrote:

                  what it's reading is corrupt or not, that's determined at a higher level.

                  That seems idealistic or odd. If it is or was throwing an exception then it seems like it did know it was corrupted. Obviously the caller continued since you said it was failing a million times. Was the caller getting usable information? If so then the architecture indicates that it is not in general corrupted. But rather that some of the data is not useable so that is how it should be treated. Conversely, to me, if I was seeing a million actual failures then I would question the need to attempt to retrieve 'valid' data. Corrupted generally would refer to some random process. That random process might create a character that passes your checks (regardless of how you check) but still represent bad data. Matter of fact the more data like this that exists the more likely that becomes. So "millions" would suggest that some other bad data would end up being accepted as good.

                  P 1 Reply Last reply
                  0
                  • J jschell

                    PIEBALDconsult wrote:

                    what it's reading is corrupt or not, that's determined at a higher level.

                    That seems idealistic or odd. If it is or was throwing an exception then it seems like it did know it was corrupted. Obviously the caller continued since you said it was failing a million times. Was the caller getting usable information? If so then the architecture indicates that it is not in general corrupted. But rather that some of the data is not useable so that is how it should be treated. Conversely, to me, if I was seeing a million actual failures then I would question the need to attempt to retrieve 'valid' data. Corrupted generally would refer to some random process. That random process might create a character that passes your checks (regardless of how you check) but still represent bad data. Matter of fact the more data like this that exists the more likely that becomes. So "millions" would suggest that some other bad data would end up being accepted as good.

                    P Offline
                    P Offline
                    PIEBALDconsult
                    wrote on last edited by
                    #9

                    jschell wrote:

                    seems like it did know it was corrupted.

                    No, the code had a bug whereby it "took too long" to read a value/token. I have fixed the bug so now the value/token gets read and returned and then the caller can determine whether or not the value is reasonable.

                    jschell wrote:

                    failing a million times

                    No, not failing a million times, failing once.

                    jschell wrote:

                    the architecture indicates that it is not in general corrupted

                    In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that. A large text value may contain a few non-ASCII characters, such as "smart quotes", and that's fine. The issue was that when a value which contains many non-ASCII characters, it took too long to read -- it also happens that the value was corrupt in this case.

                    jschell wrote:

                    Corrupted generally would refer to some random process.

                    Yes, and we still don't know what is causing the corruption.

                    jschell wrote:

                    random process might create a character that passes your checks (regardless of how you check) but still represent bad data

                    As is always the case, and again the reader can't determine that.

                    jschell wrote:

                    some other bad data would end up being accepted as good.

                    As is always the case, and again the reader can't determine that. In this particular case of corruption, the process will fail when it tries to stuff a million-plus characters into a database column which allows only two-hundred. But yes, even now there may be a file on the way which is corrupt in such a way that the bad value fits and the load won't fail. The reader won't care, the loader won't care, but some other part of the process will (probably) freak out.

                    J 1 Reply Last reply
                    0
                    • P PIEBALDconsult

                      jschell wrote:

                      seems like it did know it was corrupted.

                      No, the code had a bug whereby it "took too long" to read a value/token. I have fixed the bug so now the value/token gets read and returned and then the caller can determine whether or not the value is reasonable.

                      jschell wrote:

                      failing a million times

                      No, not failing a million times, failing once.

                      jschell wrote:

                      the architecture indicates that it is not in general corrupted

                      In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that. A large text value may contain a few non-ASCII characters, such as "smart quotes", and that's fine. The issue was that when a value which contains many non-ASCII characters, it took too long to read -- it also happens that the value was corrupt in this case.

                      jschell wrote:

                      Corrupted generally would refer to some random process.

                      Yes, and we still don't know what is causing the corruption.

                      jschell wrote:

                      random process might create a character that passes your checks (regardless of how you check) but still represent bad data

                      As is always the case, and again the reader can't determine that.

                      jschell wrote:

                      some other bad data would end up being accepted as good.

                      As is always the case, and again the reader can't determine that. In this particular case of corruption, the process will fail when it tries to stuff a million-plus characters into a database column which allows only two-hundred. But yes, even now there may be a file on the way which is corrupt in such a way that the bad value fits and the load won't fail. The reader won't care, the loader won't care, but some other part of the process will (probably) freak out.

                      J Offline
                      J Offline
                      jschell
                      wrote on last edited by
                      #10

                      PIEBALDconsult wrote:

                      No, not failing a million times, failing once.

                      You did say in the OP the following which seemed to me to suggest that it would have failed a million times.

                      "Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not."

                      PIEBALDconsult wrote:

                      In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that

                      Back to being too idealistic to me in terms of your original point of how the code should handle it. The method is only part of this process. It is not part of a multi-use library (presumably.) And the problem being solved does involve characters that are not ascii. So nothing wrong with this method specifically dealing with that by using the 'if' solution. You could, idealistically, make the caller deal with it before calling the method in the first place. To me even spending time on that consideration seems like overkill. It would not be worth my time for consideration.

                      P 1 Reply Last reply
                      0
                      • J jschell

                        PIEBALDconsult wrote:

                        No, not failing a million times, failing once.

                        You did say in the OP the following which seemed to me to suggest that it would have failed a million times.

                        "Apparently having the catch fire occasionally is OK, but firing a million times in rapid succession is decidedly not."

                        PIEBALDconsult wrote:

                        In general, yes, non-ASCII characters cannot be considered invalid by the reader, only the caller can determine that

                        Back to being too idealistic to me in terms of your original point of how the code should handle it. The method is only part of this process. It is not part of a multi-use library (presumably.) And the problem being solved does involve characters that are not ascii. So nothing wrong with this method specifically dealing with that by using the 'if' solution. You could, idealistically, make the caller deal with it before calling the method in the first place. To me even spending time on that consideration seems like overkill. It would not be worth my time for consideration.

                        P Offline
                        P Offline
                        PIEBALDconsult
                        wrote on last edited by
                        #11

                        Having the catch fire is not a failure.

                        jschell wrote:

                        It is not part of a multi-use library

                        Yes, it is, of course it is.

                        jschell wrote:

                        deal with it before calling the method

                        Uh, what? You can't handle characters before they've been read.

                        1 Reply Last reply
                        0
                        Reply
                        • Reply as topic
                        Log in to reply
                        • Oldest to Newest
                        • Newest to Oldest
                        • Most Votes


                        • Login

                        • Don't have an account? Register

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • World
                        • Users
                        • Groups