Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. C#
  4. Decoding GBK, .NET 6 edition

Decoding GBK, .NET 6 edition

Scheduled Pinned Locked Moved C#
csharpdotnetcomdata-structuresjson
5 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    Lost User
    wrote on last edited by
    #1

    I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:

    char[] buffer = new char[size];
    int numberOfChars;
    fixed (char* bufferptr = buffer)
    fixed (byte* rawptr = raw)
    numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);

    Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?

    L Richard DeemingR 2 Replies Last reply
    0
    • L Lost User

      I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:

      char[] buffer = new char[size];
      int numberOfChars;
      fixed (char* bufferptr = buffer)
      fixed (byte* rawptr = raw)
      numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);

      Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?

      L Offline
      L Offline
      lmoelleb
      wrote on last edited by
      #2

      Encoding.GetEncoding Method (System.Text) | Microsoft Docs[^] I do not see any changes from .NET 4.8?

      L 1 Reply Last reply
      0
      • L Lost User

        I have some CSV files encoded in GBK, aka codepage 936, and need to load them as strings (or something sufficiently string-like, whatever) for further processing. In the old days, I could call some function such as `File.ReadAllText` (or read the file line by line etc) and specify CP936 as the encoding. But in .NET 6, I can't. The only valid options are ASCII, Latin 1, and a couple of flavours of UTF. That sounds unlikely, right? But [here](https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0) is the documentation for the Encoding class, and in the big table halfway down the page, you can see that *almost everything is gone*. Almost as if the thinking is now "people should just use UTF-8 or UTF-16 nowadays". If it were up to me, those file would be encoded in UTF-8, but they're just not. So, right now what I do is this, assuming that I've read the file into an array `byte[] raw` and `int size` bytes were successfully read into it:

        char[] buffer = new char[size];
        int numberOfChars;
        fixed (char* bufferptr = buffer)
        fixed (byte* rawptr = raw)
        numberOfChars = MultiByteToWideChar(936, 0, rawptr, readSize, bufferptr, buffer.Length);

        Calling [MultiByteToWideChar](https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar) via a dllimport. Then afterwards I can use `numberOfChars` to create a `Span` of the appropriate length. That works, but it seems like a serious step backwards compared to .NET 4. Also there seems to be no way (no reasonable way anyway) to read/convert *chunks* of the file this way, as `MultiByteToWideChar` does not report the leftover bytes at the end of the chunk. Are there any better options?

        Richard DeemingR Offline
        Richard DeemingR Offline
        Richard Deeming
        wrote on last edited by
        #3

        As far as I can see, code page 936 is called "gb2312" in .NET Framework, but .NET 6 doesn't seem to know about it - as you said, the Encoding.GetEncodings method only lists seven options. It looks like you need to register the code pages provider from the System.Text.Encoding.CodePages package, which seems to be included in .NET 6 by default:

        Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
        Encoding gb2312 = Encoding.GetEncoding(936); // Chinese Simplified (GB2312)


        "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

        "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

        L 1 Reply Last reply
        0
        • L lmoelleb

          Encoding.GetEncoding Method (System.Text) | Microsoft Docs[^] I do not see any changes from .NET 4.8?

          L Offline
          L Offline
          lmoelleb
          wrote on last edited by
          #4

          Oops, misread the docs, so yes - it is changed and CodePagesEncodingProvider.Instance Property (System.Text) | Microsoft Docs[^] seems to be the best option.

          1 Reply Last reply
          0
          • Richard DeemingR Richard Deeming

            As far as I can see, code page 936 is called "gb2312" in .NET Framework, but .NET 6 doesn't seem to know about it - as you said, the Encoding.GetEncodings method only lists seven options. It looks like you need to register the code pages provider from the System.Text.Encoding.CodePages package, which seems to be included in .NET 6 by default:

            Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
            Encoding gb2312 = Encoding.GetEncoding(936); // Chinese Simplified (GB2312)


            "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            Good, that works and isn't weird

            1 Reply Last reply
            0
            Reply
            • Reply as topic
            Log in to reply
            • Oldest to Newest
            • Newest to Oldest
            • Most Votes


            • Login

            • Don't have an account? Register

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • World
            • Users
            • Groups