Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Other Discussions
  3. The Weird and The Wonderful
  4. WTF-8

WTF-8

Scheduled Pinned Locked Moved The Weird and The Wonderful
questioncsharpdatabasealgorithmsdata-structures
12 Posts 3 Posters 29 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P Offline
    P Offline
    PIEBALDconsult
    wrote on last edited by
    #1

    Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the GetString method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:

        // Returns a string containing the decoded representation of a range of
        // bytes in a byte array.
        //
        // Internally we override this for performance
        //
        \[Pure\]
        public virtual String GetString(byte\[\] bytes, int index, int count)
        {
            return new String(GetChars(bytes, index, count));
        }
    

    Does that mean that it doesn't actually use my decoder? Shouldn't it call GetDecoder() and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:

    The Unicode Standard requires decoders to
    "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

    and

    The standard also recommends replacing each error with the replacement character "�" (U+FFFD).

    Which I choose not to do...

    These recommendations are not often followed.

    But it makes me think that the few U+FFFD characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.

         ----------------
    
    P H 2 Replies Last reply
    0
    • P PIEBALDconsult

      Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the GetString method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:

          // Returns a string containing the decoded representation of a range of
          // bytes in a byte array.
          //
          // Internally we override this for performance
          //
          \[Pure\]
          public virtual String GetString(byte\[\] bytes, int index, int count)
          {
              return new String(GetChars(bytes, index, count));
          }
      

      Does that mean that it doesn't actually use my decoder? Shouldn't it call GetDecoder() and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:

      The Unicode Standard requires decoders to
      "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

      and

      The standard also recommends replacing each error with the replacement character "�" (U+FFFD).

      Which I choose not to do...

      These recommendations are not often followed.

      But it makes me think that the few U+FFFD characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.

           ----------------
      
      P Offline
      P Offline
      Peter_in_2780
      wrote on last edited by
      #2

      I got a phishing email the other day. Embedded HTML to imitate a Micro$oft login form (posting credentials to evil.org of course). Inline base-64 encoded, easy peasy. %-encoded inside that. One and a half times.... The outer decode works, but still got %'s in there. Decode again and *barf*, it's broken (but only in some places). Wasn't game to see what a browser would make of it. Given browsers' general tolerance of coding errors, I suspect it might just have worked.

      Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012

      1 Reply Last reply
      0
      • P PIEBALDconsult

        Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the GetString method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:

            // Returns a string containing the decoded representation of a range of
            // bytes in a byte array.
            //
            // Internally we override this for performance
            //
            \[Pure\]
            public virtual String GetString(byte\[\] bytes, int index, int count)
            {
                return new String(GetChars(bytes, index, count));
            }
        

        Does that mean that it doesn't actually use my decoder? Shouldn't it call GetDecoder() and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:

        The Unicode Standard requires decoders to
        "... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

        and

        The standard also recommends replacing each error with the replacement character "�" (U+FFFD).

        Which I choose not to do...

        These recommendations are not often followed.

        But it makes me think that the few U+FFFD characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.

             ----------------
        
        H Offline
        H Offline
        honey the codewitch
        wrote on last edited by
        #3

        .NET doesn't do that to begin with because it wouldn't make any sense. The problem is your CSV has bad encoding. What you wrote is a workaround for a poorly encoded file. That's not .NET's business, and frankly, if it did that, it would be a Bad Thing(TM)

        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

        P 1 Reply Last reply
        0
        • H honey the codewitch

          .NET doesn't do that to begin with because it wouldn't make any sense. The problem is your CSV has bad encoding. What you wrote is a workaround for a poorly encoded file. That's not .NET's business, and frankly, if it did that, it would be a Bad Thing(TM)

          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

          P Offline
          P Offline
          PIEBALDconsult
          wrote on last edited by
          #4

          Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.

          H 2 Replies Last reply
          0
          • P PIEBALDconsult

            Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.

            H Offline
            H Offline
            honey the codewitch
            wrote on last edited by
            #5

            Because in the decades that UTF-8 is available you're the second person to need the feature.

            Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

            P 1 Reply Last reply
            0
            • P PIEBALDconsult

              Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.

              H Offline
              H Offline
              honey the codewitch
              wrote on last edited by
              #6

              Adding, there's another issue. What if your intent was to embed control characters into UTF-8? .NET cannot do this for you without breaking the UTF-8 spec.

              Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

              P 1 Reply Last reply
              0
              • H honey the codewitch

                Because in the decades that UTF-8 is available you're the second person to need the feature.

                Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                P Offline
                P Offline
                PIEBALDconsult
                wrote on last edited by
                #7

                Probably the first. The other guy on the team hadn't noticed the issue.

                1 Reply Last reply
                0
                • H honey the codewitch

                  Adding, there's another issue. What if your intent was to embed control characters into UTF-8? .NET cannot do this for you without breaking the UTF-8 spec.

                  Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                  P Offline
                  P Offline
                  PIEBALDconsult
                  wrote on last edited by
                  #8

                  For instance?

                  honey the codewitch wrote:

                  .NET cannot do this for you

                  Bet it can. :cool:

                  H 1 Reply Last reply
                  0
                  • P PIEBALDconsult

                    For instance?

                    honey the codewitch wrote:

                    .NET cannot do this for you

                    Bet it can. :cool:

                    H Offline
                    H Offline
                    honey the codewitch
                    wrote on last edited by
                    #9

                    Yeah Microsoft could break UTF8 to make you happy and make everyone else mad. And make .NET broken. I'll get back to you when someone besides you thinks this is a good idea.

                    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                    P 1 Reply Last reply
                    0
                    • H honey the codewitch

                      Yeah Microsoft could break UTF8 to make you happy and make everyone else mad. And make .NET broken. I'll get back to you when someone besides you thinks this is a good idea.

                      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                      P Offline
                      P Offline
                      PIEBALDconsult
                      wrote on last edited by
                      #10

                      But seriously, what are you saying it can't do?

                      H 1 Reply Last reply
                      0
                      • P PIEBALDconsult

                        But seriously, what are you saying it can't do?

                        H Offline
                        H Offline
                        honey the codewitch
                        wrote on last edited by
                        #11

                        I'm saying they can't recursively decode UTF-8 without breaking the spec. Edit: I feel like I'm peeing in your Wheaties, but that's not my intent. I'm just saying it's not .NET's place to satisfy your requirement. You could write a Nuget package for it, but it's completely non-standard behavior and would break the spec + potentially break other code.

                        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                        P 1 Reply Last reply
                        0
                        • H honey the codewitch

                          I'm saying they can't recursively decode UTF-8 without breaking the spec. Edit: I feel like I'm peeing in your Wheaties, but that's not my intent. I'm just saying it's not .NET's place to satisfy your requirement. You could write a Nuget package for it, but it's completely non-standard behavior and would break the spec + potentially break other code.

                          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

                          P Offline
                          P Offline
                          PIEBALDconsult
                          wrote on last edited by
                          #12

                          honey the codewitch wrote:

                          they can't recursively decode UTF-8 without breaking the spec.

                          I don't see how you arrive at that conclusion.

                          honey the codewitch wrote:

                          not .NET's place to satisfy your requirement

                          I agree.

                          honey the codewitch wrote:

                          would break the spec

                          In what way exactly? Particularly if the caller has control over whether or not it does. But you mentioned something about writing control characters in UTF-8 -- which include carriage-return, line-feed, form-feed, etc. -- so I don't understand what you meant that it would break UTF-8. Whatever situation you are trying to communicate, I am sure .net can do it already, and it doesn't "break UTF-8".

                          1 Reply Last reply
                          0
                          Reply
                          • Reply as topic
                          Log in to reply
                          • Oldest to Newest
                          • Newest to Oldest
                          • Most Votes


                          • Login

                          • Don't have an account? Register

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • World
                          • Users
                          • Groups