Encoding rationale

trønderen

(I didn't find a suitable programming forum for this questeion, as it is not about programming :-))

Years ago, I read arguments for the chosen UTF-8 encoding that I remember as convincing and rational.

Several standards encode variable length integers in a simpler way: 7 bits in each octet, the upper bit set in all but the last octet.
Reading/decoding: If bit 7 of first octet is set, set destination to -1, otherwise (or if value is unsigned) to 0. Repeat: Shift desitination value 7 bits left; add next byte AND 0x7F; until byte AND 0x80 = 0.
Writing/encoding: 0/sign-extend value to multiple of 7 bits. From top bits: If next run of 8 bits (i.e. the the 7 to potentially be stored plus the sign bit of the next group) are all identical (all 1 or all 0), skip to next group. Otherwise, Loop overremaining 7-bit groups: If not last group, AND 0x80. Store as next octet.

One argument (e.g. in Wikipedia) in favor of the UTF8 encoding is that if you jump right into an octet sequence, finding the start of the character is simple. But it is no more difficult searching for an octed with the top bit reset.

Using the 0x80 bit as a 'There is more!' flag allows all 21 bit Unicode characters to be encoded in three octets, rather than UTF8's four. Eight times as many character codes would fit in two octets. The format can handle 32, 64, 128 bits and any other integer length.

I vaguely remember (from long ago) convincing arguments making me think that UTF8 designers did the right thing. I just can't recall those arguments. Can anyone guide me to a place to find them? (Or repeat them here!)

At the momentI am unable to see why the simple 'Upper bit means there is more' would be not just as good, or in some respects: better, than the UTF8 encoding.

Kornfeld Eliyahu Peter

UTF-8 has these advanteges as I see...

It does not waste memory (at the byte level)
Backward compatible to ASCII
Can mix languages freely

Mircea Neacsu

His proposal would fits the 1 and 2 points. If I understand it correctly, it allows you to encode 2^14 code points in 2 bytes vs 2^11 in UTF-8. On one side of the balance you have 15000 code points that would get a shorter encoding and on the other side, the drawbacks of yet another encoding standard. I think I'd vote for keep just one standard.

trønderen

A simple "Uppermost bit means 'There is more'" has all these advantages, plus it saves space compared to UTF8: Four times as many characters can be coded in two octets where UTF8 requires three, and all remaining characters can be coded in three octets, where UTF8 requires four octets for any character from 0x10000 and up.

The second point applies to the simpler coding scheme as well.

The third point, about mixing languages freely, is a function of using a larger (21 bit) character code. Any encoding capable of encoding 21 bit values can handle it.

I still wonder why the UTF8 designers chose such a complex encoding scheme, when a much simpler one would satsify "all" requirements. Or: All requirements that I can see. There must be other requirements that I do not see.

Jorgen Andersson

@trønderen It's about error handling. If the file or stream you're reading is corrupted there is a much higher chance that the character will be replaced with the replacement character "�" (U+FFFD) instead of just wrong. (assuming a properly implemented decoder)

trønderen

Is the robustness against errors based simply on the logic that are about 4000 invald codes for every valid one, so random bit pattern is unlikely to be a 'valid' code point, in other words: That any encoding using a small fraction of the code space? Or is there something particular to the way UTF8 does it?

As long as you stick to ASCII text, i.e. single octet UTF8 encoding, the error must hit the upper bit to make an invalid code; 7 of 8 bit flips will go unnoticed. Even if the upper bit is flipped, the code may end up as a 'valid' multibyte code point.

About 30 years ago (read: when technology was less developed than today), I was at a presentation of Frame Relay. One guy in the audience questioned the end-to-end checksum verification, replacing X.25 hop-by-hop verification. The speaker responded with a grin: In modern fiber networks, there are no transmission errors! (Then he moderated the statement somewhat: The error frequency is so low that even if a frame must be retransmitted all the way when it happens, it is much cheaper than hop-by-hop checking.)

I can't think of any other data type where the choice of encoding is affected by error robustness consideration; take IEEE 754 as an example. The common strategy is to protect not each individual element, but to add robustness by collecting a number of them into a block that is augmented by a (data type independent) checksum or error correction value.

Do you happen to know a link to a UTF8 encoding discussion that might enlighten the error protection argument?

(A comment to @Mircea-Neacsu : Even though I compare UTF8 to a simpler encoding, I did mean to propose it as a UTF8 replacement; it was just for comparison purposes, to learn the true rationale behind UTF8. I am very much against the proliferation of alternatives we have almost everywhere in the programming world, and would seriously want UTF8 to be the single external coding of all text!)

Jorgen Andersson

Well, the original draft proposal is a good start: https://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf

And the RFC: https://www.rfc-editor.org/rfc/rfc3629

Mircea Neacsu

I think I may have found an answer to your question (thanks @Jorgen-Andersson for the link): with UTF-8 the first byte of the encoding indicates the number of expected continuation bytes.

Your proposed system, I'll call it T-8 for brevity, is also somewhat less robust for initial sync if you are "eavesdropping" on a communication stream. You have to drop all initial bytes with 8th bit set up until, and including, the first byte with 8th bit clear. Only after that you can be sure that you are in sync.

trønderen

@Mircea-Neacsu: the first byte of the encoding indicates the number of expected continuation bytes.

Well, if it was so ... If a byte is the first of two, three, four bytes: Yes. If it is the first of one byte: No. When an argument of 'expected continuation bytes' holds sometimes, but not all, the argument fails. (With Western text, the rule is broken regularly.)

@Mircea-Neacsu : less robust for initial sync if you are "eavesdropping"

Making it simpler for an eavesdropper to break into a communication to synchronize is rarely a primary design criterion for an encoding design ...

Besides, it makes little difference if you drop one, two or three bytes with '10' in the uppper bits, or if you drop one byte with '1' in the uppper bit. If there are two of those, then you know that you have the entire character code (the longest valid one is two bytes with 1 at the top plus the following one with 0 at the top), like in UTF8 where a byte with '110'. '1110' or '11110' at the top tells you that you are at the start of the character.

OK, so I see that there are 'arguments' in favor of the UTF8 design. But I do not accept that 'Any argument is a good argument'. I do not see any of the arguments presented as 'good design arguments', whether it is to make eavesdropper's syncing easier, a rule about trailing bytes that sometimes holds although not in the most common case, nor error robustness where in the most common case 7 out of 8 random bit errors go unnoticed.

There was a (pre-internet) network named 'Bitnet', 'bit' being an acronym for 'Because It's There'. That's a really strong argument in favor of UTF8: It is there, and at least for web pages, it seems to be capable of clearing the ground, getting rid of the umpteen squared competing alternative encodings. Let's hope that it spreads to all computer applications, not just web pages.

Success is not equivalent to design or engineering equivalence (just look at internet ...), but a less-than-perfect design is much to prefer over the complete chaos. UTF8 is a prime example.

I am none-PC in that I want to be well aware of the weaknesses as well as the strengths of the tools I am using. I guess I am well aware of UTF8 weaknesses. The reason for my initial question was intented as search for true strenghts that I have overlooked. It seems that there are not much to speak of.

Nevertheless, I will continue to advocate UTF8. Because it is there, and that is far better than a comple chaos.

Thanks to all for the comments you have made!