Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
CODE PROJECT For Those Who Code
  • Home
  • Articles
  • FAQ
Community
  1. Home
  2. The Lounge
  3. Simple Encoding

Simple Encoding

Scheduled Pinned Locked Moved The Lounge
javacomquestion
16 Posts 6 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • E Offline
    E Offline
    eggie5
    wrote on last edited by
    #1

    I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

    J J J M R 5 Replies Last reply
    0
    • E eggie5

      I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

      J Offline
      J Offline
      Jeremy Falcon
      wrote on last edited by
      #2

      Well, as it I see it, it addresses different types of character encodings rather than what an encoding is/is good for. Of course, if that's the intent than it seems fine to me. Jeremy Falcon

      E 1 Reply Last reply
      0
      • J Jeremy Falcon

        Well, as it I see it, it addresses different types of character encodings rather than what an encoding is/is good for. Of course, if that's the intent than it seems fine to me. Jeremy Falcon

        E Offline
        E Offline
        eggie5
        wrote on last edited by
        #3

        Uh, can you rephrase that? I dont understand... /\ |_ E X E GG

        J 1 Reply Last reply
        0
        • E eggie5

          Uh, can you rephrase that? I dont understand... /\ |_ E X E GG

          J Offline
          J Offline
          Jeremy Falcon
          wrote on last edited by
          #4

          It's explaining the different types of character encodings rather than what a character encoding is. Jeremy Falcon

          E 2 Replies Last reply
          0
          • J Jeremy Falcon

            It's explaining the different types of character encodings rather than what a character encoding is. Jeremy Falcon

            E Offline
            E Offline
            eggie5
            wrote on last edited by
            #5

            Yeah, I guess explaining what character encoding is would be TOO technical... SO, it's good it's not there. Thanks. /\ |_ E X E GG

            1 Reply Last reply
            0
            • E eggie5

              I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

              J Offline
              J Offline
              Jorgen Sigvardsson
              wrote on last edited by
              #6

              I'm not so sure about your definitions. The Unicode <-> UTF-8 part I have no problem with. ASCII is a strict definition of 128 characters, encoded using 7 bits. ASCII-8 is an extension of that, and defines exactly 256 characters. Windows-1252 also defines 256 characters. They're all character sets, as they define exactly where each character is in the "alphabet". So in my point of view; "ASCII is to Windows-1252 as Unicode is to UTF-8." doesn't hold.

              E 1 Reply Last reply
              0
              • J Jorgen Sigvardsson

                I'm not so sure about your definitions. The Unicode <-> UTF-8 part I have no problem with. ASCII is a strict definition of 128 characters, encoded using 7 bits. ASCII-8 is an extension of that, and defines exactly 256 characters. Windows-1252 also defines 256 characters. They're all character sets, as they define exactly where each character is in the "alphabet". So in my point of view; "ASCII is to Windows-1252 as Unicode is to UTF-8." doesn't hold.

                E Offline
                E Offline
                eggie5
                wrote on last edited by
                #7

                So ASCII, ASCII-8 and Unicode are all character sets? /\ |_ E X E GG

                J 2 Replies Last reply
                0
                • E eggie5

                  So ASCII, ASCII-8 and Unicode are all character sets? /\ |_ E X E GG

                  J Offline
                  J Offline
                  Jorgen Sigvardsson
                  wrote on last edited by
                  #8

                  I believe so. This list[^] seem to imply that. IIRC, Unicode was the character set designed to kill the need for every other character sets, as it's supposed to be large enough.

                  1 Reply Last reply
                  0
                  • E eggie5

                    So ASCII, ASCII-8 and Unicode are all character sets? /\ |_ E X E GG

                    J Offline
                    J Offline
                    Jorgen Sigvardsson
                    wrote on last edited by
                    #9

                    ASCII is to the alphabet, as Unicode is to the union of all alphabets (or the sum of all alphabets - to make it easier on the layman's ear). I think that would be an appropriate analogy.

                    E 1 Reply Last reply
                    0
                    • E eggie5

                      I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

                      J Offline
                      J Offline
                      jhaga
                      wrote on last edited by
                      #10

                      I like Joel's article on the subject: http://www.joelonsoftware.com/articles/Unicode.html[^] jhaga --------------------------------- Every generation laughs at the old fashions, but follows religiously the new. Henry David Thoreau, "Walden", 1854

                      1 Reply Last reply
                      0
                      • J Jorgen Sigvardsson

                        ASCII is to the alphabet, as Unicode is to the union of all alphabets (or the sum of all alphabets - to make it easier on the layman's ear). I think that would be an appropriate analogy.

                        E Offline
                        E Offline
                        eggie5
                        wrote on last edited by
                        #11

                        ASCII is to the english alphabet, as Unicode all alll alphabets in the world??? Thanks, I like that. /\ |_ E X E GG

                        J 1 Reply Last reply
                        0
                        • E eggie5

                          ASCII is to the english alphabet, as Unicode all alll alphabets in the world??? Thanks, I like that. /\ |_ E X E GG

                          J Offline
                          J Offline
                          Jorgen Sigvardsson
                          wrote on last edited by
                          #12

                          Something like that, yes. And to explain UTF-8, use Morse code for an analogy. Morse code conveys the same thing as written language, it's just communicated a bit differently.

                          1 Reply Last reply
                          0
                          • J Jeremy Falcon

                            It's explaining the different types of character encodings rather than what a character encoding is. Jeremy Falcon

                            E Offline
                            E Offline
                            eggie5
                            wrote on last edited by
                            #13

                            "What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used." Thought you might be interested in that... /\ |_ E X E GG

                            J 1 Reply Last reply
                            0
                            • E eggie5

                              I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

                              M Offline
                              M Offline
                              Michael Dunn
                              wrote on last edited by
                              #14

                              You're mixing up the concepts of character set and encoding. The Joel article mentioned earlier covers this is more detail. --Mike-- Visual C++ MVP :cool: LINKS~! Ericahist | NEW!! PimpFish | CP SearchBar v3.0 | C++ Forum FAQ

                              1 Reply Last reply
                              0
                              • E eggie5

                                I was trying to simply explain encoding to some people using pretty absolute terms to limit confusion. Can you guys tell me if there are any huge fallacies? "There are two major character encodings in use today: ASCII and UTF-8. ASCII is a set of character encodings based upon the English alphabet and is one of the first popular encoding systems of the computer age (Within the ASCII system, is a common encoding Windows-1252). Based upon the English alphabet the ASCII system only has support for about 96 printable characters, which implicitly includes support for French, Spanish, and other Latin-based languages. It does not however, support characters for non-Latin based languages such as the CJK family of East Asian scripts; this is where UTF-8 and the Unicode initiative come into play. UTF-8 – the other major character encoding – is a subset of Unicode, which is an industry standard designed to allow text and symbols from all languages to be consistently represented and manipulated by computers (“UTF-8” and “Unicode” can be used synonymously as can ASCII be used synonymously for Windows-1252, etc.). UTF-8 is able to represent virtually any character in the Unicode standard – which includes virtually all characters of every language in the world, yet also is backward compatible with ASCII. It is for this reason it is steadily becoming the preferred encoding and over ASCII. In fact, the client you are viewing this with is very likely using some sort of Unicode derivation to encode the characters you see. UTF-8 is the most common encoding encountered on the web, Java and windows use UTF-16, and UTF-32 is used by various UNIX systems (Unicode.com). ASCII is to Windows-1252 as Unicode is to UTF-8. ASCII and Unicode are character sets, while Windows-1252 and UTF-8 are character encodings." /\ |_ E X E GG

                                R Offline
                                R Offline
                                RandomMonkey
                                wrote on last edited by
                                #15

                                James's Catch22 page may also be of assistance.

                                1 Reply Last reply
                                0
                                • E eggie5

                                  "What do web browsers do if they don't find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used." Thought you might be interested in that... /\ |_ E X E GG

                                  J Offline
                                  J Offline
                                  Jeremy Falcon
                                  wrote on last edited by
                                  #16

                                  eggie5 wrote:

                                  Thought you might be interested in that...

                                  Thanks. Funny thing is, that's exactly what Notepad does also to detect which set the file is in. Jeremy Falcon

                                  1 Reply Last reply
                                  0
                                  Reply
                                  • Reply as topic
                                  Log in to reply
                                  • Oldest to Newest
                                  • Newest to Oldest
                                  • Most Votes


                                  • Login

                                  • Don't have an account? Register

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • World
                                  • Users
                                  • Groups