WTF-8
-
Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the
GetString
method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:// Returns a string containing the decoded representation of a range of // bytes in a byte array. // // Internally we override this for performance // \[Pure\] public virtual String GetString(byte\[\] bytes, int index, int count) { return new String(GetChars(bytes, index, count)); }
Does that mean that it doesn't actually use my decoder? Shouldn't it call
GetDecoder()
and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:The Unicode Standard requires decoders to
"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."and
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).
Which I choose not to do...
These recommendations are not often followed.
But it makes me think that the few
U+FFFD
characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.----------------
-
Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the
GetString
method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:// Returns a string containing the decoded representation of a range of // bytes in a byte array. // // Internally we override this for performance // \[Pure\] public virtual String GetString(byte\[\] bytes, int index, int count) { return new String(GetChars(bytes, index, count)); }
Does that mean that it doesn't actually use my decoder? Shouldn't it call
GetDecoder()
and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:The Unicode Standard requires decoders to
"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."and
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).
Which I choose not to do...
These recommendations are not often followed.
But it makes me think that the few
U+FFFD
characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.----------------
I got a phishing email the other day. Embedded HTML to imitate a Micro$oft login form (posting credentials to evil.org of course). Inline base-64 encoded, easy peasy. %-encoded inside that. One and a half times.... The outer decode works, but still got %'s in there. Decode again and *barf*, it's broken (but only in some places). Wasn't game to see what a browser would make of it. Given browsers' general tolerance of coding errors, I suspect it might just have worked.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
-
Reviewing a CSV file today (which already has inconsistent quotes and such), I noticed it has a few characters (e.g. non-breaking space) encoded as UTF-8 -- fine, no big deal. But they still look odd after decoding... ah, they're encoded as UTF-8 twice... double-UTF-8. :doh: So now I have to write a recursive UTF-8 decoder... :sigh: Why doesn't .net simply do that to begin with? :mad: <<== That's a rhetorical question. It'll be breakfast at Milliways again. Edit: 4/20 -- I have a working recursive UTF-8 decoding algorithm, in a custom decoder for a custom encoding (derived from the built-in UTF-8 encoding, so encoding should be as per normal). What was unexpected was that the
GetString
method of the encoding didn't call the custom Decoder. I just had a look at the refercence code GetString and I see:// Returns a string containing the decoded representation of a range of // bytes in a byte array. // // Internally we override this for performance // \[Pure\] public virtual String GetString(byte\[\] bytes, int index, int count) { return new String(GetChars(bytes, index, count)); }
Does that mean that it doesn't actually use my decoder? Shouldn't it call
GetDecoder()
and use that decoder? (I'm not experienced at reading the reference source.) I'll get back to it on Monday. Edit: 4/21 -- Reading some more about UTF-8 on the 'pedia, I see:The Unicode Standard requires decoders to
"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."and
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).
Which I choose not to do...
These recommendations are not often followed.
But it makes me think that the few
U+FFFD
characters I see in the file may have begun as unencoded characters which were errantly read with a UTF-8 decoder. Which means that the file I have is in even worse condition than I thought. Anyway, my current decoder is quite permissive in what it accepts -- preferring not to throw exceptions, but rather pass any errant bytes along to the caller. I will likely alter that next week. Edit: 4/22 -- A rough logic diagram of my algorithm.----------------
.NET doesn't do that to begin with because it wouldn't make any sense. The problem is your CSV has bad encoding. What you wrote is a workaround for a poorly encoded file. That's not .NET's business, and frankly, if it did that, it would be a Bad Thing(TM)
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
.NET doesn't do that to begin with because it wouldn't make any sense. The problem is your CSV has bad encoding. What you wrote is a workaround for a poorly encoded file. That's not .NET's business, and frankly, if it did that, it would be a Bad Thing(TM)
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.
-
Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.
Because in the decades that UTF-8 is available you're the second person to need the feature.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Oh, I know. Yet why not have an option? As when getting a directory listing, you can specify TopDirectoryOnly or AllDirectories. Not that they would have expected this to be needed twenty years ago. What I have so far will deal with double-encoding, but I can imagine the rabbit hole going deeper. All the way to the turtles I'm sure. Anyway, it's a good exercise.
Adding, there's another issue. What if your intent was to embed control characters into UTF-8? .NET cannot do this for you without breaking the UTF-8 spec.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Because in the decades that UTF-8 is available you're the second person to need the feature.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
Probably the first. The other guy on the team hadn't noticed the issue.
-
Adding, there's another issue. What if your intent was to embed control characters into UTF-8? .NET cannot do this for you without breaking the UTF-8 spec.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
For instance?
honey the codewitch wrote:
.NET cannot do this for you
Bet it can. :cool:
-
For instance?
honey the codewitch wrote:
.NET cannot do this for you
Bet it can. :cool:
Yeah Microsoft could break UTF8 to make you happy and make everyone else mad. And make .NET broken. I'll get back to you when someone besides you thinks this is a good idea.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
Yeah Microsoft could break UTF8 to make you happy and make everyone else mad. And make .NET broken. I'll get back to you when someone besides you thinks this is a good idea.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
But seriously, what are you saying it can't do?
-
But seriously, what are you saying it can't do?
I'm saying they can't recursively decode UTF-8 without breaking the spec. Edit: I feel like I'm peeing in your Wheaties, but that's not my intent. I'm just saying it's not .NET's place to satisfy your requirement. You could write a Nuget package for it, but it's completely non-standard behavior and would break the spec + potentially break other code.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
-
I'm saying they can't recursively decode UTF-8 without breaking the spec. Edit: I feel like I'm peeing in your Wheaties, but that's not my intent. I'm just saying it's not .NET's place to satisfy your requirement. You could write a Nuget package for it, but it's completely non-standard behavior and would break the spec + potentially break other code.
Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix
honey the codewitch wrote:
they can't recursively decode UTF-8 without breaking the spec.
I don't see how you arrive at that conclusion.
honey the codewitch wrote:
not .NET's place to satisfy your requirement
I agree.
honey the codewitch wrote:
would break the spec
In what way exactly? Particularly if the caller has control over whether or not it does. But you mentioned something about writing control characters in UTF-8 -- which include carriage-return, line-feed, form-feed, etc. -- so I don't understand what you meant that it would break UTF-8. Whatever situation you are trying to communicate, I am sure .net can do it already, and it doesn't "break UTF-8".