Unicode testing

Henry Minute

Every time this sort of topic comes up in The Lounge, one member (can't remember his name for the moment). pops up and avers that Turkish will trip it up. Where is he now that you need him? Anyway, I'd add it to the list of test languages, as well.

Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

Johann Gerell

Ä: http://en.wikipedia.org/wiki/Ä[^] Ö: http://en.wikipedia.org/wiki/Ö[^]

-- Time you enjoy wasting is not wasted time - Bertrand Russel

Dan Neely

Maunder posted it to subtle bugs recently... http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html[^]

Today's lesson is brought to you by the word "niggardly". Remember kids, don't attribute to racism what can be explained by Scandinavian language roots. -- Robert Royall

Alan Balkany

Another reason to test for multiple languages is that in some languages, some phrases get much longer than you'd expect, and this can throw off the layout of your GUI. If you're using the UTF-16 encoding (so all characters are a constant two bytes), you'll be limited to characters in the Basic Multilingual Plane, which has most modern languages and the more common Asian characters. If so, make sure it has all the Chinese characters you'll need.

pg az

With e.g. a hex editor you can create say a UNICODE file without the BOM, which Notepad at least is smart enough to recognize as UNICODE-little-endian, since the odd-bytes are uniformly zero. Doing a "Save As" into UNICODE from Notepad, it likes to insert the BOM, which seems inelegant to me since "it's not really a character", if you roll-your-own parsing routines they need to know to skip over the BOM. I wonder out-there in the wide world, do real Foreign-Language-Files normally have BOM's or not, offhand I would tend to guess they DO but that could of course be completely wrong.

pg--az

Member 96

Yeah something like that. It bit me in the ass years ago because all our tables in our app are named starting with an A to be distinctive and the queries threw errors on some computers and we traced it down to tables that start with AA were interpreted as that A with the dots above it.

"Creating your own blog is about as easy as creating your own urine, and you're about as likely to find someone else interested in it." -- Lore Sjöberg

Fabio Franco

Tad McClellan wrote:

Thats what happens when non-techinical people start making decisions

Oh boy, I know the feeling. And I hate it. Recently when this special non-techy manager started making some sense-less decisions I strugled not to scream to him: "Why don't you build the f#@$@% system yourslef then?":mad:

Trevortni

I guess my question would be whether it's supposed to support all these different languages, or just be Unicode-compliant. If it's supposed to support different languages, it should be tested under all the languages it's supposed to support (which would actually be a translation issue, not a programming issue); otherwise, well, you already made your point.

bjarneds

You are probably thinking about A with a ring (not dots, aka. umlaut) above it: http://en.wikipedia.org/wiki/Å[^]. The A with a ring is a different character, one that is used in several danish words. The old spelling of these words used the double AA instead of an A with a ring, but many names still use the double AA. Note that this letter (no matter if it is written as an A with a ring or a double AA) is the last character in the danish alphabet. This means that the result of sorting the strings "AA" and "BB" depends on the current culture. Of course, you shouldn't be required to know details like this when you are coding. Instead, you should assume nothing when it comes to cultures, characters, spelling etc. I think the MSDN article Writing Culture-Safe Managed Code (http://msdn.microsoft.com/en-us/library/ms994325.aspx[^]) may have a few surprises for most developers. So in my opinion, testing with different characters (and cultures) do make sense. Not only to make sure an application is Unicode compliant, but more importantly to catch some of the incorrect assumptions developers make about cultures etc.

bjarneds

pg--az wrote:

"it's not really a character"

Actually, the BOM (byte-order mark) is a real character, known as "zero-width no-break space". This is a good choice, because it makes no harm to programs that just need to display the content, even if they don't skip over it (zero-width = invisible, no-break = no undesired wrapping behaviour).

Member 96

I agree, my code is fine, I've always adhered to Unicode standards however this problem was in the FireBird SQL drivers, I worked around it by ensuring that all my dynamic SQL had double CAPITAL a's.

"Creating your own blog is about as easy as creating your own urine, and you're about as likely to find someone else interested in it." -- Lore Sjöberg

petersgyoung

There is case that Chinese characters work but English characters not work correctly. Chinese name can be safely represented in 4 unicode characters but it is not possible to represent English name in 4 characters. Sometimes, you need to force user to input ASCII in certain field, e.g. Product Code. Please see my blog for testing whether user is inputting ASCII or Unicode.

petersgyoung