Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. Either I'm missing something or .NET is

Either I'm missing something or .NET is

Scheduled Pinned Locked Moved The Lounge
csharphardwarequestion
20 Posts 9 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    honey the codewitch
    wrote on last edited by
    #1

    So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

    Real programmers use butterflies

    Greg UtasG Richard DeemingR OriginalGriffO L R 6 Replies Last reply
    0
    • H honey the codewitch

      So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

      Real programmers use butterflies

      Greg UtasG Offline
      Greg UtasG Offline
      Greg Utas
      wrote on last edited by
      #2

      Telling that there wasn't one for Germany. DoneMark is very close to how it's actually pronounced natively (in Norwegian and Swedish, at any rate; Danish has been referred to as a throat disease rather than a language).

      <p><a href="https://github.com/GregUtas/robust-services-core/blob/master/README.md">Robust Services Core</a>
      <em>The fox knows many things, but the hedgehog knows one big thing.</em></p>

      1 Reply Last reply
      0
      • H honey the codewitch

        So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

        Real programmers use butterflies

        Richard DeemingR Offline
        Richard DeemingR Offline
        Richard Deeming
        wrote on last edited by
        #3

        I don't know enough about UTF-32, but what about: Char.IsHighSurrogate Method (System) | Microsoft Docs[^] Char.ConvertToUtf32 Method (System) | Microsoft Docs[^] Char.ConvertFromUtf32(Int32) Method (System) | Microsoft Docs[^] Or maybe Jon Skeet's article will have something useful: Unicode and .NET[^]


        "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

        "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

        H 1 Reply Last reply
        0
        • H honey the codewitch

          So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

          Real programmers use butterflies

          OriginalGriffO Offline
          OriginalGriffO Offline
          OriginalGriff
          wrote on last edited by
          #4

          Um ... Char.IsWhiteSpace Method (System) | Microsoft Docs[^] and Char.GetUnicodeCategory Method (System) | Microsoft Docs[^] Or am I missing something?

          "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony AntiTwitter: @DalekDave is now a follower!

          "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
          "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

          H 1 Reply Last reply
          0
          • H honey the codewitch

            So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

            Real programmers use butterflies

            L Offline
            L Offline
            Lost User
            wrote on last edited by
            #5

            What is about Char.IsWhiteSpace Method (System) | Microsoft Docs[^]

            It does not solve my Problem, but it answers my question

            1 Reply Last reply
            0
            • OriginalGriffO OriginalGriff

              Um ... Char.IsWhiteSpace Method (System) | Microsoft Docs[^] and Char.GetUnicodeCategory Method (System) | Microsoft Docs[^] Or am I missing something?

              "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony AntiTwitter: @DalekDave is now a follower!

              H Offline
              H Offline
              honey the codewitch
              wrote on last edited by
              #6

              Only works for 16 bit unicode values. Not these 32bit surrogates

              Real programmers use butterflies

              1 Reply Last reply
              0
              • Richard DeemingR Richard Deeming

                I don't know enough about UTF-32, but what about: Char.IsHighSurrogate Method (System) | Microsoft Docs[^] Char.ConvertToUtf32 Method (System) | Microsoft Docs[^] Char.ConvertFromUtf32(Int32) Method (System) | Microsoft Docs[^] Or maybe Jon Skeet's article will have something useful: Unicode and .NET[^]


                "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                H Offline
                H Offline
                honey the codewitch
                wrote on last edited by
                #7

                Okay yes, but once i have that converted (either to a 32bit int, or a double-char string, i can't do anything with it. I can't call char.IsWhiteSpace with it. I can't do anything but print its value. Which is stupid

                Real programmers use butterflies

                Richard DeemingR 1 Reply Last reply
                0
                • H honey the codewitch

                  Okay yes, but once i have that converted (either to a 32bit int, or a double-char string, i can't do anything with it. I can't call char.IsWhiteSpace with it. I can't do anything but print its value. Which is stupid

                  Real programmers use butterflies

                  Richard DeemingR Offline
                  Richard DeemingR Offline
                  Richard Deeming
                  wrote on last edited by
                  #8

                  ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]


                  "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                  "These people looked deep within my soul and assigned me a number based on the order in which I joined" - Homer

                  R H D 3 Replies Last reply
                  0
                  • H honey the codewitch

                    So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

                    Real programmers use butterflies

                    R Offline
                    R Offline
                    RugbyLeague
                    wrote on last edited by
                    #9

                    "So what the hell are you supposed to do with these values" hope they never appear.

                    H 1 Reply Last reply
                    0
                    • Richard DeemingR Richard Deeming

                      ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]


                      "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                      R Offline
                      R Offline
                      RugbyLeague
                      wrote on last edited by
                      #10

                      And EBCDIC - everybody always forgets EBCDIC :sigh:

                      Greg UtasG U 2 Replies Last reply
                      0
                      • R RugbyLeague

                        "So what the hell are you supposed to do with these values" hope they never appear.

                        H Offline
                        H Offline
                        honey the codewitch
                        wrote on last edited by
                        #11

                        Pretty much! :thumbsup:

                        Real programmers use butterflies

                        1 Reply Last reply
                        0
                        • Richard DeemingR Richard Deeming

                          ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]


                          "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                          H Offline
                          H Offline
                          honey the codewitch
                          wrote on last edited by
                          #12

                          I've encountered a weird one. But I don't remember if it was a surrogate or not. I just remember it didn't print to the console properly (it cooked it) or didn't save to source as a literal value or something. It came up as whitespace, but only when I used these huge "not ranges" which are like [^a-z] (anything but a lower case letter) that "anything" part created ranges all throughout the 16-bit unicode spectrum. And that's when I ran into issues with one whitespace character.

                          Real programmers use butterflies

                          1 Reply Last reply
                          0
                          • R RugbyLeague

                            And EBCDIC - everybody always forgets EBCDIC :sigh:

                            Greg UtasG Offline
                            Greg UtasG Offline
                            Greg Utas
                            wrote on last edited by
                            #13

                            And sixbit on DEC's PDP systems!

                            <p><a href="https://github.com/GregUtas/robust-services-core/blob/master/README.md">Robust Services Core</a>
                            <em>The fox knows many things, but the hedgehog knows one big thing.</em></p>

                            1 Reply Last reply
                            0
                            • Richard DeemingR Richard Deeming

                              ASCII FTW! 127 characters should be enough for anyone! :-D I'm not sure there are any whitespace characters that would be encoded as a surrogate pair: Whitespace character - Wikipedia[^]


                              "These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer

                              D Offline
                              D Offline
                              Dr Walt Fair PE
                              wrote on last edited by
                              #14

                              Actually Baudot it sufficient at 5 bits. CQ de W5ALT

                              Walt Fair, Jr.PhD P. E. Comport Computing Specializing in Technical Engineering Software

                              1 Reply Last reply
                              0
                              • H honey the codewitch

                                So .NET uses 16bit unicode by default, and with the territory comes the possibility of 32 bit codes embedded as surrogate pairs. That's all well and good even if it complicates things. However, .NET provides no functions for operating on those 32-bit values. No char.IsWhiteSpace(). No GetUnicodeCategory(), nada. So what the hell are you supposed to do with these values? *angery*

                                Real programmers use butterflies

                                B Offline
                                B Offline
                                BillWoodruff
                                wrote on last edited by
                                #15

                                idea: research using Unicode categories in your RegEx ? [^]

                                «One day it will have to be officially admitted that what we have christened reality is an even greater illusion than the world of dreams.» Salvador Dali

                                H 1 Reply Last reply
                                0
                                • B BillWoodruff

                                  idea: research using Unicode categories in your RegEx ? [^]

                                  «One day it will have to be officially admitted that what we have christened reality is an even greater illusion than the world of dreams.» Salvador Dali

                                  H Offline
                                  H Offline
                                  honey the codewitch
                                  wrote on last edited by
                                  #16

                                  I am using those. The category for surrogates is surrogate. Not helpful. Combining a hi and lo surrogate you get a 2 char string. The 2 char string cannot be queried for its unicode category in .NET AFAIK

                                  Real programmers use butterflies

                                  B 1 Reply Last reply
                                  0
                                  • H honey the codewitch

                                    I am using those. The category for surrogates is surrogate. Not helpful. Combining a hi and lo surrogate you get a 2 char string. The 2 char string cannot be queried for its unicode category in .NET AFAIK

                                    Real programmers use butterflies

                                    B Offline
                                    B Offline
                                    BillWoodruff
                                    wrote on last edited by
                                    #17

                                    honey the codewitch wrote:

                                    The 2 char string cannot be queried for its unicode category in .NET AFAIK

                                    It is a mess, but, check this against what you expect, now:

                                    public void PrintUniCodeRange(int sc, int ec)
                                    {
                                    bool isKey;

                                    string key = "";
                                    
                                    for (int i = sc; i <= ec; i++)
                                    {
                                        string ucString = char.ConvertFromUtf32(i);
                                        
                                        isKey = i < 256;
                                    
                                        if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString();
                                    
                                        UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0);
                                    
                                        if (cat != UnicodeCategory.OtherNotAssigned)
                                        {
                                            Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}");
                                        }
                                    }
                                    

                                    }

                                    Calling the above with 8192 to 8233 parameters:

                                    #8192 | Unicode Category: SpaceSeparator
                                    #8193 | Unicode Category: SpaceSeparator
                                    #8194 | Unicode Category: SpaceSeparator
                                    #8195 | Unicode Category: SpaceSeparator
                                    #8196 | Unicode Category: SpaceSeparator
                                    #8197 | Unicode Category: SpaceSeparator
                                    #8198 | Unicode Category: SpaceSeparator
                                    #8199 | Unicode Category: SpaceSeparator
                                    #8200 | Unicode Category: SpaceSeparator
                                    #8201 | Unicode Category: SpaceSeparator
                                    #8202 | Unicode Category: SpaceSeparator
                                    #8203 | Unicode Category: Format
                                    #8204 | Unicode Category: Format
                                    #8205 | Unicode Category: Format
                                    #8206 | Unicode Category: Format
                                    #8207 | Unicode Category: Format
                                    #8208 | Unicode Category: DashPunctuation
                                    #8209 | Unicode Category: DashPunctuation
                                    #8210 | Unicode Category: DashPunctuation
                                    #8211 | Unicode Category: DashPunctuation
                                    #8212 | Unicode Category: DashPunctuation
                                    #8213 | Unicode Category: DashPunctuation
                                    #8214 | Unicode Category: OtherPunctuation
                                    #8215 | Unicode Category: OtherPunctuation
                                    #8216 | Unicode Category: InitialQuotePunctuation
                                    #8217 | Unicode Category: FinalQuotePunctuation
                                    #8218 | Unicode Category: OpenPunctuation
                                    #8219 | Unicode Category: InitialQuotePunctuation
                                    #8220 | Unicode Category: InitialQuotePunctuation
                                    #8221 | Unicode Category: FinalQuotePunctuation
                                    #8222 | Unicode Category: OpenPunctuation
                                    #8223 | Unicode Category: InitialQuotePunctuation
                                    #8224 | Unicode Category: OtherPunctuation
                                    #8225 | Unicode Category: OtherPunctuation
                                    #8226 | Unicode Category: OtherPunctuation
                                    #8227 | Unicode Category: OtherPunctuation
                                    #8228 | Unicode Category: OtherPunctuation
                                    #8229 | Unicode Category: OtherPunctuation
                                    #8230 | Unicode Category: Othe

                                    H 2 Replies Last reply
                                    0
                                    • B BillWoodruff

                                      honey the codewitch wrote:

                                      The 2 char string cannot be queried for its unicode category in .NET AFAIK

                                      It is a mess, but, check this against what you expect, now:

                                      public void PrintUniCodeRange(int sc, int ec)
                                      {
                                      bool isKey;

                                      string key = "";
                                      
                                      for (int i = sc; i <= ec; i++)
                                      {
                                          string ucString = char.ConvertFromUtf32(i);
                                          
                                          isKey = i < 256;
                                      
                                          if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString();
                                      
                                          UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0);
                                      
                                          if (cat != UnicodeCategory.OtherNotAssigned)
                                          {
                                              Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}");
                                          }
                                      }
                                      

                                      }

                                      Calling the above with 8192 to 8233 parameters:

                                      #8192 | Unicode Category: SpaceSeparator
                                      #8193 | Unicode Category: SpaceSeparator
                                      #8194 | Unicode Category: SpaceSeparator
                                      #8195 | Unicode Category: SpaceSeparator
                                      #8196 | Unicode Category: SpaceSeparator
                                      #8197 | Unicode Category: SpaceSeparator
                                      #8198 | Unicode Category: SpaceSeparator
                                      #8199 | Unicode Category: SpaceSeparator
                                      #8200 | Unicode Category: SpaceSeparator
                                      #8201 | Unicode Category: SpaceSeparator
                                      #8202 | Unicode Category: SpaceSeparator
                                      #8203 | Unicode Category: Format
                                      #8204 | Unicode Category: Format
                                      #8205 | Unicode Category: Format
                                      #8206 | Unicode Category: Format
                                      #8207 | Unicode Category: Format
                                      #8208 | Unicode Category: DashPunctuation
                                      #8209 | Unicode Category: DashPunctuation
                                      #8210 | Unicode Category: DashPunctuation
                                      #8211 | Unicode Category: DashPunctuation
                                      #8212 | Unicode Category: DashPunctuation
                                      #8213 | Unicode Category: DashPunctuation
                                      #8214 | Unicode Category: OtherPunctuation
                                      #8215 | Unicode Category: OtherPunctuation
                                      #8216 | Unicode Category: InitialQuotePunctuation
                                      #8217 | Unicode Category: FinalQuotePunctuation
                                      #8218 | Unicode Category: OpenPunctuation
                                      #8219 | Unicode Category: InitialQuotePunctuation
                                      #8220 | Unicode Category: InitialQuotePunctuation
                                      #8221 | Unicode Category: FinalQuotePunctuation
                                      #8222 | Unicode Category: OpenPunctuation
                                      #8223 | Unicode Category: InitialQuotePunctuation
                                      #8224 | Unicode Category: OtherPunctuation
                                      #8225 | Unicode Category: OtherPunctuation
                                      #8226 | Unicode Category: OtherPunctuation
                                      #8227 | Unicode Category: OtherPunctuation
                                      #8228 | Unicode Category: OtherPunctuation
                                      #8229 | Unicode Category: OtherPunctuation
                                      #8230 | Unicode Category: Othe

                                      H Offline
                                      H Offline
                                      honey the codewitch
                                      wrote on last edited by
                                      #18

                                      hmm, I wonder what my test was doing wrong, because GetUnicodeCategory(string, int) was returning only single char values for me i thought. maybe i had a bug

                                      Real programmers use butterflies

                                      1 Reply Last reply
                                      0
                                      • B BillWoodruff

                                        honey the codewitch wrote:

                                        The 2 char string cannot be queried for its unicode category in .NET AFAIK

                                        It is a mess, but, check this against what you expect, now:

                                        public void PrintUniCodeRange(int sc, int ec)
                                        {
                                        bool isKey;

                                        string key = "";
                                        
                                        for (int i = sc; i <= ec; i++)
                                        {
                                            string ucString = char.ConvertFromUtf32(i);
                                            
                                            isKey = i < 256;
                                        
                                            if (isKey) key = ((Keys)Enum.Parse(typeof(Keys), i.ToString())).ToString();
                                        
                                            UnicodeCategory cat = Char.GetUnicodeCategory(ucString, 0);
                                        
                                            if (cat != UnicodeCategory.OtherNotAssigned)
                                            {
                                                Console.WriteLine($"#{i} | Unicode Category: {cat} {(isKey ? "! Keys Enum: " + key : "")}");
                                            }
                                        }
                                        

                                        }

                                        Calling the above with 8192 to 8233 parameters:

                                        #8192 | Unicode Category: SpaceSeparator
                                        #8193 | Unicode Category: SpaceSeparator
                                        #8194 | Unicode Category: SpaceSeparator
                                        #8195 | Unicode Category: SpaceSeparator
                                        #8196 | Unicode Category: SpaceSeparator
                                        #8197 | Unicode Category: SpaceSeparator
                                        #8198 | Unicode Category: SpaceSeparator
                                        #8199 | Unicode Category: SpaceSeparator
                                        #8200 | Unicode Category: SpaceSeparator
                                        #8201 | Unicode Category: SpaceSeparator
                                        #8202 | Unicode Category: SpaceSeparator
                                        #8203 | Unicode Category: Format
                                        #8204 | Unicode Category: Format
                                        #8205 | Unicode Category: Format
                                        #8206 | Unicode Category: Format
                                        #8207 | Unicode Category: Format
                                        #8208 | Unicode Category: DashPunctuation
                                        #8209 | Unicode Category: DashPunctuation
                                        #8210 | Unicode Category: DashPunctuation
                                        #8211 | Unicode Category: DashPunctuation
                                        #8212 | Unicode Category: DashPunctuation
                                        #8213 | Unicode Category: DashPunctuation
                                        #8214 | Unicode Category: OtherPunctuation
                                        #8215 | Unicode Category: OtherPunctuation
                                        #8216 | Unicode Category: InitialQuotePunctuation
                                        #8217 | Unicode Category: FinalQuotePunctuation
                                        #8218 | Unicode Category: OpenPunctuation
                                        #8219 | Unicode Category: InitialQuotePunctuation
                                        #8220 | Unicode Category: InitialQuotePunctuation
                                        #8221 | Unicode Category: FinalQuotePunctuation
                                        #8222 | Unicode Category: OpenPunctuation
                                        #8223 | Unicode Category: InitialQuotePunctuation
                                        #8224 | Unicode Category: OtherPunctuation
                                        #8225 | Unicode Category: OtherPunctuation
                                        #8226 | Unicode Category: OtherPunctuation
                                        #8227 | Unicode Category: OtherPunctuation
                                        #8228 | Unicode Category: OtherPunctuation
                                        #8229 | Unicode Category: OtherPunctuation
                                        #8230 | Unicode Category: Othe

                                        H Offline
                                        H Offline
                                        honey the codewitch
                                        wrote on last edited by
                                        #19

                                        Thank you! Turns out there was a bug in my code where i wasn't passing doublechar strings in. They ended up single char.

                                        Real programmers use butterflies

                                        1 Reply Last reply
                                        0
                                        • R RugbyLeague

                                          And EBCDIC - everybody always forgets EBCDIC :sigh:

                                          U Offline
                                          U Offline
                                          User 2893688
                                          wrote on last edited by
                                          #20

                                          Will someone please think about the children. [https://tenor.com/FJmS.gif\](https://tenor.com/FJmS.gif)

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups