Decoding GBK, .NET 6 edition

Lost User

I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:

char[] buffer = new char[size];
int numberOfChars;
fixed (char* bufferptr = buffer)
fixed (byte* rawptr = raw)
numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);

Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?

lmoelleb

Encoding.GetEncoding Method (System.Text) | Microsoft Docs[^] I do not see any changes from .NET 4.8?

Richard Deeming

As far as I can see, code page 936 is called "gb2312" in .NET Framework, but .NET 6 doesn't seem to know about it - as you said, the Encoding.GetEncodings method only lists seven options. It looks like you need to register the code pages provider from the System.Text.Encoding.CodePages package, which seems to be included in .NET 6 by default:

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding gb2312 = Encoding.GetEncoding(936); // Chinese Simplified (GB2312)

"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

lmoelleb

Oops, misread the docs, so yes - it is changed and CodePagesEncodingProvider.Instance Property (System.Text) | Microsoft Docs[^] seems to be the best option.

Lost User

Good, that works and isn't weird