If you can't match UTF32 codepoints, you don't support unicode!

honey the codewitch

*sigh* - (half serious here) what even is the point of UTF16? You still have the possibility of encountering surrogate characters, which means if you want to support unicode streams you have to handle that possibility. MSSQL supports UTF16, but not UTF32 Consequently, here's me fetching the next character of an NVARCHAR(x) or NTEXT stream, with UTF32 support. Forgive the grotty code, as there is no unsigned data types and no bit shifts, etc

DECLARE @valueEnd INT = DATALENGTH(@value)/2+1
DECLARE @index INT = 1
DECLARE @ch BIGINT
DECLARE @tch BIGINT
...
SET @ch = UNICODE(SUBSTRING(@value,@index,1))
SET @tch = @ch - 0xd800
IF @tch < 0 SET @tch = @tch + 2147483648
IF @tch < 2048
BEGIN
SET @ch = @ch * 1024
SET @index = @index + 1
IF @index >= @valueEnd RETURN -1 -- error
SET @ch = @ch + UNICODE(SUBSTRING(@value,@index,1)) - 0x35fdc00
END

This is hateful and slow. I haven't even tested it yet. I know it doesn't gracefully handle any and all invalid unicode streams but to make it do that is even worse. To heck with this. Why UTF16 with no functions to convert a surrogate pair to UTF32? it's ridic. Someone asked me the other day why I don't like Microsoft SQL Server. Here's reason #1359 Bad MSSQL! BAD!

Real programmers use butterflies

Lost User

Yep it is something like a compromise. UTF16 is usually enough but we also need to do the unusual :) Surrogate characters is only one beast. Expressing accented characters e.g. 'é' in different ways is one other. Ok, here at least we have the possibility to normalize it (https://stackoverflow.com/questions/7811976/normalize-unicode-string-in-sql-server). :^)

honey the codewitch

Isn't an issue for me, as this is a sqlized regex engine. All matching conditions are specified using UTF32 stored as BigInt values (since SQL has no unsigned ints) All incoming characters are resolved using the code I posted. Accent marks will be matched properly. Displaying them is firmly in "Somebody else's problem" territory.

Real programmers use butterflies

Dan Neely

UTF-16 is the result of developers underestimating how many characters would be needed for a universal encoding and lingers like the stench of a soiled diaper because MS rushed it into production in the 90s before anyone realized it was a mistake. :doh:

Did you ever see history portrayed as an old man with a wise brow and pulseless heart, weighing all things in the balance of reason? Is not rather the genius of history like an eternal, imploring maiden, full of fire, with a burning heart and flaming soul, humanly warm and humanly beautiful? --Zachris Topelius