Trigraphs and C++
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
They get detected in *comments* too, which can cause much hillarity
Help me! I'm turning into a grapefruit! Buzzwords!
-
They get detected in *comments* too, which can cause much hillarity
Help me! I'm turning into a grapefruit! Buzzwords!
I'm hoping they are removed in the new C++ standard. If you can't type ordinary C++ symbols, you should switch terminal, NOT your sanity for typability! (is that a word? :-D)
-- Kein Mitleid Für Die Mehrheit
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
Cool, I didn't even knew about trigraphs. -Saurabh
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
BTW, it's often useful to split strings into separate quote-delimited parts that will be assembled at compile-time. For example, printf("\xAE" "abracadabra" "\xAF");will compile whereas "\xAEabracadabra\xAF" will likely either not compile or else yield a different string. Since ??" is not a trigraph, splitting string literals after double question marks should avoid trouble.
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
-
In an application I'm writing, the format of an identifier is DDDDDD-DDDD where D is digit. In certain cases the identifier is not yet known. I thought I'd display such a "NULL" instance as ??????-???? instead of just a blank field. I was very surprised to find out that the UI displayed it as ????~????. WTF? I start debugging CString et al to see if I was using some hidden/unknown escape sequence. No such thing. I really didn't expect to find anything like that, but I don't want any loose ends. Then I tried concatenating a CString into ??????-????, and it worked. So it could not have triggered some kind of escape code in the UI (a BCGSoft list control in this case). Then it dawns on me that the compiler must be the culprit for some reason. I declare
const char* lpsz = "??????-????";
and lo and behold, the debugger displays lpsz as ????~????. Then I remembered the trigraphs - an old feature that I suspect FEW programmers have ever used. Turns out that ??- is the trigraph for ~. I thought this must surely be a compiler bug - why is the compiler messing with my strings? According to http://en.wikipedia.org/wiki/Digraphs_and_trigraphs[^], trigraphs are detected at stream level. A trigraph is not by itself a token! It is detected and replaced inline of the text stream, and will therefore be picked up in both code AND strings. Digraphs it turns out, are only detected on token level, meaning that strings are untouched.-- Kein Mitleid Für Die Mehrheit
I had come across this issue some years ago when trying to write C code from an IBM 3270 (?) terminal. IBM's keyboard was missing a few of the characters needed so the trigraph trick was the solution. I didn't realise it had actually become part of the language.
It's time for a new signature.