Simple Encoding

eggie5

I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

Jeremy Falcon

Well, as it I see it, it addresses different types of character encodings rather than what an encoding is/is good for. Of course, if that's the intent than it seems fine to me. Jeremy Falcon

eggie5

Uh, can you rephrase that? I dont understand... /\ |_ E X E GG

Jeremy Falcon

It's explaining the different types of character encodings rather than what a character encoding is. Jeremy Falcon

eggie5

Yeah, I guess explaining what character encoding is would be TOO technical... SO, it's good it's not there. Thanks. /\ |_ E X E GG

Jorgen Sigvardsson

I'm not so sure about your definitions. The Unicode <-> UTF-8 part I have no problem with. ASCII is a strict definition of 128 characters, encoded using 7 bits. ASCII-8 is an extension of that, and defines exactly 256 characters. Windows-1252 also defines 256 characters. They're all character sets, as they define exactly where each character is in the "alphabet". So in my point of view; "ASCII is to Windows-1252 as Unicode is to UTF-8." doesn't hold.

eggie5

So ASCII, ASCII-8 and Unicode are all character sets? /\ |_ E X E GG

Jorgen Sigvardsson

I believe so. This list[^] seem to imply that. IIRC, Unicode was the character set designed to kill the need for every other character sets, as it's supposed to be large enough.

Jorgen Sigvardsson

ASCII is to the alphabet, as Unicode is to the union of all alphabets (or the sum of all alphabets - to make it easier on the layman's ear). I think that would be an appropriate analogy.

jhaga

I like Joel's article on the subject: http://www.joelonsoftware.com/articles/Unicode.html[^] jhaga --------------------------------- Every generation laughs at the old fashions, but follows religiously the new. Henry David Thoreau, "Walden", 1854

eggie5

ASCII is to the english alphabet, as Unicode all alll alphabets in the world??? Thanks, I like that. /\ |_ E X E GG

Jorgen Sigvardsson

Something like that, yes. And to explain UTF-8, use Morse code for an analogy. Morse code conveys the same thing as written language, it's just communicated a bit differently.

eggie5

"What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used." Thought you might be interested in that... /\ |_ E X E GG

Michael Dunn

You're mixing up the concepts of character set and encoding. The Joel article mentioned earlier covers this is more detail. --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | NEW!! PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

RandomMonkey

James's Catch22 page may also be of assistance.

Jeremy Falcon

eggie5 wrote:

Thought you might be interested in that...

Thanks. Funny thing is, that's exactly what Notepad does also to detect which set the file is in. Jeremy Falcon