Encoding Question

eggie5

Would it be true to call these characters 翻訳と通訳 non-ISO latin chars? If not, what is the correct name to call them? Unicode? I don't think that japanese/chinese is latin...

PJ Arends

I'd call them 'squares';P

"You're obviously a superstar." - Christian Graus about me - 12 Feb '03 "Obviously ??? You're definitely a superstar!!!" - mYkel - 21 Jun '04 "There's not enough blatant self-congratulatory backslapping in the world today..." - HumblePie - 21 Jun '05 Within you lies the power for good - Use it!

Josh Martin

I'm dealing heavily with Unicode data right now, and I just refer to them as Unicode characters, as that encompasses the entire range that I'm dealing with. The "non-ISO latin chars" confused me for a sec, because I initially read it as "(non-ISO) latin chars" instead of "(non-ISO latin) chars". Josh Find a penny, pick it up, and all day long you'll have a back-ache...

eggie5

So, Unicode encompasses all every character there is right? or i that UTF? Anyways, what is a term that I can use to differentiate those symbols from say, standard English... in encoding lingo... /\ |_ E X E GG

Josh Martin

The Unicode character set contains all of the characters (Latin, Hebrew, Arabic, Chinese, Japanase, etc) laid out by the Unicode standards. UTF-8 and UTF-16 are encoding schemes for storing a Unicode code-point in a binary representation. Josh Find a penny, pick it up, and all day long you'll have a back-ache...

eggie5

So what would be the most specific name for these 三维地图, if I wanted to differnetiate it from standard english text? /\ |_ E X E GG

Josh Martin

I guess it would all depend on your target audience. For me, I'd probably just call them Japanese characters, even though they could just as well be Chinese (I studied some Japanese in school, and know that in some cases, the Japanese Kanji is identical to the Chinese characters, but the words are pronounced differently). Since my target audience is mainly my QA department right now, I either just say "Unicode characters" or I identify the specific portion of the character set that I'm referring to at the time (either Chinese/Japanese, Hebrew, Arabic, etc). Josh Find a penny, pick it up, and all day long you'll have a back-ache...

eggie5

But if I said "Unicode Characters" wouldn't that mean "alfdkjsf" too? /\ |_ E X E GG

cmk

Generally you will specify the code page id and description. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81rn.asp[^] Unicode numbers every character for every language. The unicode 'number space' is carved into blocks/sets - one for each language/code page. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_6bqr.asp[^] ...cmk Save the whales - collect the whole set

Shog9 0

eggie5 wrote:

I don't think that japanese/chinese is latin...

That's for sure. :) I'm no expert, but here's a tool that'll let you get the script and character names (just paste the characters into the UTF-8 box)(one at a time): http://isthisthingon.org/unicode/index.phtml[^]

My god, you're a genius! - Jörgen Sigvardsson, The Lounge

code frog 0

You would refer to those as Kanji. I did this a long time ago in C++. I believe that unicode is a 4byte wide-character set. Once upon a time you only used it if necessary but internationalization has really exploded such that most applications just use unicode. So for you you are using a unicode encoding to support Kanji. I believe it's Kanji even if it's Korean, Chinese, Japanese, etc... But for me it's been 4 years since I did this and memory fades. I think the answer you want though is "Kanji". {EDIT-ADDED} This looks very similar to the document I used way back when... http://www.cl.cam.ac.uk/~mgk25/unicode.html[^] here's the info on Kanji do a control F on the page to read the whole document with links. Unicode X11 font names end with -ISO10646-1. This is now the officially registered value for the X Logical Font Descriptor (XLFD) fields CHARSET_REGISTRY and CHARSET_ENCODING for all Unicode and ISO 10646-1 16-bit fonts. The *-ISO10646-1 fonts contain some unspecified subset of the entire Unicode character set, and users have to make sure that whatever font they select covers the subset of characters needed by them. The *-ISO10646-1 fonts usually also specify a DEFAULT_CHAR value that points to a special non-Unicode glyph for representing any character that is not available in the font (usually a dashed box, the size of an H, located at 0x00). This ensures that users at least see clearly that there is an unsupported character. The smaller fixed-width fonts such as 6x13 etc. for xterm will never be able to cover all of Unicode, because many scripts such as Kanji can only be represented in considerably larger pixel sizes than those widely used by European users. Typical Unicode fonts for European usage will contain only subsets of between 1000 and 3000 characters, such as the CEN MES-3 repertoire. You might notice that in the *-ISO10646-1 fonts the shapes of the ASCII quotation marks has slightly changed to bring them in line with the standards and practice on other platforms. {END-EDIT-ADDED}

Some assembly required. Code-frog System Architects, Inc.

-- modified at 17:34 Wednesday 30th November, 2005

code frog 0

eggie5 wrote:

But if I said "Unicode Characters" wouldn't that mean "alfdkjsf" too?

I think you would also call those specifically glyphs which is generally understood to mean symbols used in writing or as a form of writing that convey's much more than a written letter in the terms of a meaning. So a glyph by itself might mean one thing but the same glyph used with other glyphs may not have the same meaning at all in fact a totally different story might be told with using 2 glyphs for example...

Some assembly required. Code-frog System Architects, Inc.

Jorgen Sigvardsson

> I believe it's Kanji even if it's Korean, Chinese, Japanese, etc... Both the Koreans and the Japanese use kanji (Chinese letters, developed during the Han dynasty if memory serves me right). While the Japanese use a simplified version, and the Chinese use both traditional and simplified versions, I don't know much about the Korean kanji. I believe the Koreans use kanji sparsely, because the only times I remember seeing them are in context of martial arts. However, Koreans also use Hangul, which is a syllable symbology, although much more complex than our good old alphabet. The Japanese also use Katakana and Hiragana, which are both syllable symbologies. So, anything Korean or Japanese doesn't necessarily have to be kanji. :) -- Pictures[^] from my Japan trip.

Jorgen Sigvardsson

> The Unicode character set contains all of the characters (Latin, Hebrew, Arabic, Chinese, Japanase, etc) laid out by the Unicode standards. Which is far from complete. Not even half of the Japanese kanji are in the standard. IIRC, 20000 or so out of 50000 are in the standard. It has even worried some japanese schollars that the use of computers may cripple the written language! (I think he's on to something, and I don't think that problem is confined to the japanese language only :sigh:) -- Pictures[^] from my Japan trip.

code frog 0

Correct! As usual, Jorgen! :) But as I recall the team I worked on that was implementing support for unicode across all the major languages we just loosely called it "Kanji" if it was glyph based that's more the point I was getting at. Although now many years later I would use the term "glyph" based languages instead. I might piss off some people if I lumped all glyph based languages into Kanji. So I'd call it "glyph" and then I'd call it unicode and after that I'd just call it work. If it's unicode, elastic collisions, linear algebra or whatever it's all work right. :)

Some assembly required. Code-frog System Architects, Inc.

code frog 0

No WAY! Computers will explode into that problem and solve it. That challenge is ripe and begging for someone at MIT to solve over a weekend with a spare 9 volt battery, a slightly used postage stamp a magazine insert and some copper wire.;P Seriously though. We'll get that taken care of. I'm just glad I don't have to use a Kanji keyboard. Can you imagine? 18 feet long and 6 feet tall. It would have to be to get all those keys on it. I bet someone will implement a flexible keyboard that will auto-scroll with the motion of the hands. I should shut up now or I'll make someone rich. This idea is copyrighted by me. Don't touch it. It's mine.;P

Some assembly required. Code-frog System Architects, Inc.

Jorgen Sigvardsson

> So I'd call it "glyph" and then I'd call it unicode and after that I'd just call it work. If it's unicode, elastic collisions, linear algebra or whatever it's all work right. :-D I was screwed up at the University by the use of both Lisp and Scheme, hence I got stuck on symbols. These days I work a lot with barcodes and their associated symbologies, so I'm still stuck on symbols. Glyph sounds nicer though. I remember a class hierarchy in "Design Patterns: Elements of Reusable Object-Oriented Software", which was very sweet. Ever since, the word "glyph" has always had a positive ring in my ears. Woops. This MDCO (Chief Miscellaneous Department Officer) is supposed to be wake up in 7 hrs. :) (Good night that is.. :-D) -- Pictures[^] from my Japan trip. -- modified at 18:41 Wednesday 30th November, 2005

Ryan Roberts

Really good article on unicode[^] Ryan

O fools, awake! The rites you sacred hold Are but a cheat contrived by men of old, Who lusted after wealth and gained their lust And died in baseness—and their law is dust. al-Ma'arri (973-1057)

David Stone

Jörgen Sigvardsson wrote:

MDCO (Chief Miscellaneous Department Officer)

Shouldn't that be CMDO? :~

Picture a huge catholic cathedral. In it there's many people, including a gregorian monk choir. You know, those who sing beautifully. Then they start singing, in latin, as they always do: "Ad hominem..." -Jörgen Sigvardsson

-- modified at 19:29 Wednesday 30th November, 2005

Nemanja Trifunovic

Generally, these characters may be encoded with different character sets. Just by looking at them one can't say a thing about how they are encoded. If you want to differentiate them from western scripts, I think your best bet would be to call them "non-western scripts". That says nothing about encoding, and I believe that's what you want.

My programming blahblahblah blog. If you ever find anything useful here, please let me know to remove it.