Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. If you can't match UTF32 codepoints, you don't support unicode!

If you can't match UTF32 codepoints, you don't support unicode!

Scheduled Pinned Locked Moved The Lounge
databasesql-serversysadminregexhelp
4 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    honey the codewitch
    wrote on last edited by
    #1

    *sigh* - (half serious here) what even is the point of UTF16? You still have the possibility of encountering surrogate characters, which means if you want to support unicode streams you have to handle that possibility. MSSQL supports UTF16, but not UTF32 Consequently, here's me fetching the next character of an NVARCHAR(x) or NTEXT stream, with UTF32 support. Forgive the grotty code, as there is no unsigned data types and no bit shifts, etc

    DECLARE @valueEnd INT = DATALENGTH(@value)/2+1
    DECLARE @index INT = 1
    DECLARE @ch BIGINT
    DECLARE @tch BIGINT
    ...
    SET @ch = UNICODE(SUBSTRING(@value,@index,1))
    SET @tch = @ch - 0xd800
    IF @tch < 0 SET @tch = @tch + 2147483648
    IF @tch < 2048
    BEGIN
    SET @ch = @ch * 1024
    SET @index = @index + 1
    IF @index >= @valueEnd RETURN -1 -- error
    SET @ch = @ch + UNICODE(SUBSTRING(@value,@index,1)) - 0x35fdc00
    END

    This is hateful and slow. I haven't even tested it yet. I know it doesn't gracefully handle any and all invalid unicode streams but to make it do that is even worse. To heck with this. Why UTF16 with no functions to convert a surrogate pair to UTF32? it's ridic. Someone asked me the other day why I don't like Microsoft SQL Server. Here's reason #1359 Bad MSSQL! BAD!

    Real programmers use butterflies

    L D 2 Replies Last reply
    0
    • H honey the codewitch

      *sigh* - (half serious here) what even is the point of UTF16? You still have the possibility of encountering surrogate characters, which means if you want to support unicode streams you have to handle that possibility. MSSQL supports UTF16, but not UTF32 Consequently, here's me fetching the next character of an NVARCHAR(x) or NTEXT stream, with UTF32 support. Forgive the grotty code, as there is no unsigned data types and no bit shifts, etc

      DECLARE @valueEnd INT = DATALENGTH(@value)/2+1
      DECLARE @index INT = 1
      DECLARE @ch BIGINT
      DECLARE @tch BIGINT
      ...
      SET @ch = UNICODE(SUBSTRING(@value,@index,1))
      SET @tch = @ch - 0xd800
      IF @tch < 0 SET @tch = @tch + 2147483648
      IF @tch < 2048
      BEGIN
      SET @ch = @ch * 1024
      SET @index = @index + 1
      IF @index >= @valueEnd RETURN -1 -- error
      SET @ch = @ch + UNICODE(SUBSTRING(@value,@index,1)) - 0x35fdc00
      END

      This is hateful and slow. I haven't even tested it yet. I know it doesn't gracefully handle any and all invalid unicode streams but to make it do that is even worse. To heck with this. Why UTF16 with no functions to convert a surrogate pair to UTF32? it's ridic. Someone asked me the other day why I don't like Microsoft SQL Server. Here's reason #1359 Bad MSSQL! BAD!

      Real programmers use butterflies

      L Offline
      L Offline
      Lost User
      wrote on last edited by
      #2

      Yep it is something like a compromise. UTF16 is usually enough but we also need to do the unusual :) Surrogate characters is only one beast. Expressing accented characters e.g. 'é' in different ways is one other. Ok, here at least we have the possibility to normalize it (https://stackoverflow.com/questions/7811976/normalize-unicode-string-in-sql-server). :^)

      H 1 Reply Last reply
      0
      • L Lost User

        Yep it is something like a compromise. UTF16 is usually enough but we also need to do the unusual :) Surrogate characters is only one beast. Expressing accented characters e.g. 'é' in different ways is one other. Ok, here at least we have the possibility to normalize it (https://stackoverflow.com/questions/7811976/normalize-unicode-string-in-sql-server). :^)

        H Offline
        H Offline
        honey the codewitch
        wrote on last edited by
        #3

        Isn't an issue for me, as this is a sqlized regex engine. All matching conditions are specified using UTF32 stored as BigInt values (since SQL has no unsigned ints) All incoming characters are resolved using the code I posted. Accent marks will be matched properly. Displaying them is firmly in "Somebody else's problem" territory.

        Real programmers use butterflies

        1 Reply Last reply
        0
        • H honey the codewitch

          *sigh* - (half serious here) what even is the point of UTF16? You still have the possibility of encountering surrogate characters, which means if you want to support unicode streams you have to handle that possibility. MSSQL supports UTF16, but not UTF32 Consequently, here's me fetching the next character of an NVARCHAR(x) or NTEXT stream, with UTF32 support. Forgive the grotty code, as there is no unsigned data types and no bit shifts, etc

          DECLARE @valueEnd INT = DATALENGTH(@value)/2+1
          DECLARE @index INT = 1
          DECLARE @ch BIGINT
          DECLARE @tch BIGINT
          ...
          SET @ch = UNICODE(SUBSTRING(@value,@index,1))
          SET @tch = @ch - 0xd800
          IF @tch < 0 SET @tch = @tch + 2147483648
          IF @tch < 2048
          BEGIN
          SET @ch = @ch * 1024
          SET @index = @index + 1
          IF @index >= @valueEnd RETURN -1 -- error
          SET @ch = @ch + UNICODE(SUBSTRING(@value,@index,1)) - 0x35fdc00
          END

          This is hateful and slow. I haven't even tested it yet. I know it doesn't gracefully handle any and all invalid unicode streams but to make it do that is even worse. To heck with this. Why UTF16 with no functions to convert a surrogate pair to UTF32? it's ridic. Someone asked me the other day why I don't like Microsoft SQL Server. Here's reason #1359 Bad MSSQL! BAD!

          Real programmers use butterflies

          D Offline
          D Offline
          Dan Neely
          wrote on last edited by
          #4

          UTF-16 is the result of developers underestimating how many characters would be needed for a universal encoding and lingers like the stench of a soiled diaper because MS rushed it into production in the 90s before anyone realized it was a mistake. :doh:

          Did you ever see history portrayed as an old man with a wise brow and pulseless heart, weighing all things in the balance of reason? Is not rather the genius of history like an eternal, imploring maiden, full of fire, with a burning heart and flaming soul, humanly warm and humanly beautiful? --Zachris Topelius

          1 Reply Last reply
          0
          Reply
          • Reply as topic
          Log in to reply
          • Oldest to Newest
          • Newest to Oldest
          • Most Votes


          • Login

          • Don't have an account? Register

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • World
          • Users
          • Groups