Decoding GBK, .NET 6 edition
-
I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:
char[] buffer = new char[size];
int numberOfChars;
fixed (char* bufferptr = buffer)
fixed (byte* rawptr = raw)
numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?
-
I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:
char[] buffer = new char[size];
int numberOfChars;
fixed (char* bufferptr = buffer)
fixed (byte* rawptr = raw)
numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?
-
I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:
char[] buffer = new char[size];
int numberOfChars;
fixed (char* bufferptr = buffer)
fixed (byte* rawptr = raw)
numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?
As far as I can see, code page 936 is called "gb2312" in .NET Framework, but .NET 6 doesn't seem to know about it - as you said, the
Encoding.GetEncodings
method only lists seven options. It looks like you need to register the code pages provider from theSystem.Text.Encoding.CodePages
package, which seems to be included in .NET 6 by default:Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding gb2312 = Encoding.GetEncoding(936); // Chinese Simplified (GB2312)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
Encoding.GetEncoding Method (System.Text) | Microsoft Docs[^] I do not see any changes from .NET 4.8?
-
As far as I can see, code page 936 is called "gb2312" in .NET Framework, but .NET 6 doesn't seem to know about it - as you said, the
Encoding.GetEncodings
method only lists seven options. It looks like you need to register the code pages provider from theSystem.Text.Encoding.CodePages
package, which seems to be included in .NET 6 by default:Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding gb2312 = Encoding.GetEncoding(936); // Chinese Simplified (GB2312)
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer