Hebrew chars returned by Directory.GetFiles problem

impeham

I'm having the following issue: i am using "Directory.GetFiles" to retrieve all filenames from a path. In that path i have filenames which have hebrew characters in them. I use a debugger to take a close look at the characters that constructs such a filename string and i can see that the hebrew characters are not REAL char (they are numbers above 1000 - how can that be for char?). This makes a problem when i try to write the string to a file - the characters looks weird when i open it later with a text editor, and that is probably because the truly written thing for each char is 2 bytes instead of just one that represents each hebrew character. How can that be solved? Thanks.

Luc Pattyn

Hi, you can specify which character encoding should be used when writing text to a file. By default ASCII is used, and non-ASCII characters get mapped onto ASCII characters somehow (e.g. accents would be dropped); of course for very different scripts, such mapping makes no sense. You really want a file that can hold real 16-bit characters where appropriate. One way of doing this is by using a StreamWriter; one of its constructor overloads takes an Encoding object, you should consider Encoding.Unicode BTW: your Hebrew characters are real characters, if Visual shows them as numbers that's to make sure you can read them (if you're unfamiliar with the script), and you can paste them like that in an ASCII file. Normally your source files are ASCII files, taking one byte per character; as soon as you paste a non-ASCII character in a string literal or so, the file will be saved as a UTF8 or a Unicode file, and may no longer be readable by other apps. :)

Luc Pattyn

try { [Search CP Articles] [Search CP Forums] [Forum Guidelines] [My Articles] } catch { [Google] }

impeham

Well - using UTF8 with the StreamWriter did the job! Man - thanks a lot! :)

Mike Dimmick

impeham wrote:

they are numbers above 1000 - how can that be for char?

You're clearly an ex-C++ programmer. char in C# is not a byte-size quantity as it is in C++, it represents a single UTF-16 encoded value (i.e. it's a synonym for short). All strings in the .NET Framework are Unicode internally, using UTF-16. Hebrew characters fall in a block between U+0590 and U+05FF, with alef encoded at U+05D0 = 1488. The default encoding for .NET StreamWriter objects is UTF-8. Alef does indeed turn into two bytes in the output, 0xD7 0X90. If you want to use a different encoding, for example Windows codepage 1255 for Hebrew, you need to create a suitable Encoding object and pass it to the StreamWriter's constructor.

Stability. What an interesting concept. -- Chris Maunder