Simple Encoding
-
I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG
I'm not so sure about your definitions. The Unicode <-> UTF-8 part I have no problem with. ASCII is a strict definition of 128 characters, encoded using 7 bits. ASCII-8 is an extension of that, and defines exactly 256 characters. Windows-1252 also defines 256 characters. They're all character sets, as they define exactly where each character is in the "alphabet". So in my point of view; "ASCII is to Windows-1252 as Unicode is to UTF-8." doesn't hold.
-
I'm not so sure about your definitions. The Unicode <-> UTF-8 part I have no problem with. ASCII is a strict definition of 128 characters, encoded using 7 bits. ASCII-8 is an extension of that, and defines exactly 256 characters. Windows-1252 also defines 256 characters. They're all character sets, as they define exactly where each character is in the "alphabet". So in my point of view; "ASCII is to Windows-1252 as Unicode is to UTF-8." doesn't hold.
-
ASCII is to the alphabet, as Unicode is to the union of all alphabets (or the sum of all alphabets - to make it easier on the layman's ear). I think that would be an appropriate analogy.
-
I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG
I like Joel's article on the subject: http://www.joelonsoftware.com/articles/Unicode.html[^] jhaga --------------------------------- Every generation laughs at the old fashions, but follows religiously the new. Henry David Thoreau, "Walden", 1854
-
ASCII is to the alphabet, as Unicode is to the union of all alphabets (or the sum of all alphabets - to make it easier on the layman's ear). I think that would be an appropriate analogy.
-
ASCII is to the english alphabet, as Unicode all alll alphabets in the world??? Thanks, I like that. /\ |_ E X E GG
Something like that, yes. And to explain UTF-8, use Morse code for an analogy. Morse code conveys the same thing as written language, it's just communicated a bit differently.
-
It's explaining the different types of character encodings rather than what a character encoding is. Jeremy Falcon
"What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used." Thought you might be interested in that... /\ |_ E X E GG
-
I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG
You're mixing up the concepts of character set and encoding. The Joel article mentioned earlier covers this is more detail. --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | NEW!! PimpFish | CP SearchBar v3.0 | C++ Forum FAQ
-
I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG
James's Catch22 page may also be of assistance.
-
"What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used." Thought you might be interested in that... /\ |_ E X E GG
eggie5 wrote:
Thought you might be interested in that...
Thanks. Funny thing is, that's exactly what Notepad does also to detect which set the file is in. Jeremy Falcon