Unicode and codeproject article

bkelly13

Hello Daniel, Then that is the way I am going. I found a 2012 article here on code project titled

Quote:

What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc.)?

and will be using it for my first reference and learning tool. Nice quote in your siggie.

Thank you for your time If you work with telemetry, please check this bulletin board: www.irigbb.com

bkelly13

Additional reading is not yielding a good conclusion. My application is in telemetry. Cutting this to an absolute minimum, I use Excel VBA to build a text based file containing as many as 100,000 pieces of information. My applications uses that to configure itself and determine how to translate the raw input into parameters that another application displays in real time. The application can write copious amounts to text base log files so I can understand the data better and see how it runs. The only human interaction is to start the app, select a configuration file, and use checkboxes to set logging options. Everything is currently running as Unicode in Visual Studio. The app will never be used by the general public. There is no expectation of translation to other languages. But, I do want to write in a style that will be useful in other projects. Am I OK with Unicode and strings such as L"read this"? Do I need to use the UTF-8 options?

Thank you for your time

Daniel Pfeffer

Given your constraints (no public release, no translation to other languages), using Unicode is not necessary. The ANSI functions are a tiny bit slower (they must convert all string data to/from Unicode), but that is not relevant to your case. I still believe that for new programs, Unicode is the correct way to go for UI. Among other reasons, Microsoft is slowly "deprecating" its MBCS (multi-byte character set) support - in recent versions of Visual Studio, the MBCS library was a separate download! As for the data processing, that depends on the input and output formats. If your input is ASCII (alphanumerics, punctuation, CR/LF), and the output is the same, there is no need or reason to convert it to Unicode for processing. Just as a (very) short example, this coding style is perfectly valid:

#define UNICODE // defined when you set the Windows functions to Unicode-style in VS
#include #include void foo(void)
{
FILE* fp = fopen( "bar", "rb" );
int c;

//...

while ((c = getc(fp) != EOF)
{
if ( c == '\x42' )
MessageBox( NULL, L"Telemetry", L"Bad input", MB_OK );
// further processing here...
}

//...
}

Note that I am using char functions to read the data, but Unicode (wide char) functions for the UI. If you must force a Windows API to be char-based (ANSI), use the name with an 'A' suffix (e.g. MessageBoxA instead of MessageBox). If you must force it to be wide char-based (Unicode), use a 'W' suffix. This, of course, only applies to APIs that have string / character parameters. If you need to convert between Unicode and ASCII (or UTF-8), the best way to do so is using the WideCharToMultiByte() / MultiByteToWideChar() Windows APIs. I hope that this helps.

If you have an important point to make, don't try to be subtle or clever. Use a pile driver. Hit the point once. Then come back and hit it again. Then hit it a third time - a tremendous whack. --Winston Churchill

Lost User

bkelly13 wrote:

Am I OK with Unicode and strings such as L"read this"? Do I need to use the UTF-8 options?

If you make everything Unicode, you should not have any issues. Apart from perhaps converting your text files from ANSI to Unicode when you read them. Either way, Unicode is the best choice for the long term, especially as you may decide to move to Windows Forms/C# in the future.

Daniel Pfeffer

Richard MacCutchan wrote:

If you make everything Unicode, you should not have any issues.

The OP is processing real-time telemetry, which is (these days) usually char-based. IMO, there is no good reason to convert the telemetry to Unicode before processing - it slows the processing, doubles the storage requirements, and adds nothing to any processing of numeric data. Similar considerations apply to the output.

If you have an important point to make, don't try to be subtle or clever. Use a pile driver. Hit the point once. Then come back and hit it again. Then hit it a third time - a tremendous whack. --Winston Churchill

Lost User

I am well aware of what he is doing, and I only added that as a "perhaps". At the end of the day it's his choice.

Daniel Pfeffer

I sit corrected. :)

If you have an important point to make, don't try to be subtle or clever. Use a pile driver. Hit the point once. Then come back and hit it again. Then hit it a third time - a tremendous whack. --Winston Churchill

Lost User

I stand in ignorance. ;)

bkelly13

Telemetry data is usually all numbers and all binary. Economy in bandwidth is a primary goal. The only text may be things like software version embedded in some parts. Even then those are treated as binary data and handed off to the display device. The text part is where I have a "bunch" of Excel code to build configuration files. Some assembly, make that much assembly, is required to translate the vendor telemetry map (describes all the fields of the data) to something directly usable by my application. When not running in mission mode the app can write copious log files so I can verify what it did and why. Those are all text based for easy reading. Unicode is fine there. Side note/rant IRIG (Inter Range Instrumentation Group) defines telemetry standards for all government ranges. A range is a place where things like bombs are dropped and missiles shot. That standard defines bit 1 as being the MSB and bit N being the LSB. It is absolutely backwards so one of tasks of my code it to renumber all the bit fields. But the vendors do not follow the standard anyway. In one telemetry map the LSB is sometimes bit 0 and sometimes bit 1. In almost every word that has bit field definitions they have put a note that says the MSB is numbered as bit 0 or bit 1. They just cannot understand that the need to keep putting that note in there is a not so subtle indicating that they are doing things wrong. Further, they have at least six different formats for describing those bit fields. With 10,000 parameters in a telemetry stream, that becomes a nightmare for writing code to extract the data needed to process the parameters. End of rant It appears that when writing text files, Excel VBA code writes Unicode by default. Since Windows is now Unicode based, its seems much better to go with that. I am mostly there, but have not looked at my tokenizer code lately. (Each parameter is written to a text file, one line per parameter and as many as a dozen pieces of data in each line.) This text file must be in text rather than binary because I must be able to read it myself to check for errors. Other than log files, none of the real time work uses any text operations. I don't care if it takes 10 bytes per character to store the configuration file. Conclusion I'll go with Unicode all the way. Question What is this deal with this WCHAR in Visual Studio? One of the articles I found said WCHAR is equivalent to wchar_t, then said no more. Ok, but being a guy with sometimes too much self dou

Lost User

bkelly13 wrote:

What is this deal with this WCHAR

If you right click your mouse on any of these types in your source code you can then select "Go to definition", which will bring up the include file where it's defined. You can see that WCHAR is defined in winnt.h as equivalent to wchar_t which is a fundamental type known by the compiler. The definition of WCHAR is required for porting to compilers that do not have that fundamental type (or did not in the days before C++). Use whichever type you are more comfortable with, although using WCHAR tends to give more flexibility if you ever need to port your code to some alternative platform.

bkelly13

Re: Use whichever type you are more comfortable with, although using WCHAR tends to give more flexibility if you ever need to port your code to some alternative platform. I have been working with Microsoft VS for a while now and have not gotten out to play with others in a long time. I will go with that and stick with the WCHAR.

Thank you for your time If you work with telemetry, please check this bulletin board: www.irigbb.com

Theo Buys

Daniel Pfeffer wrote:

Richard MacCutchan wrote:

If you make everything Unicode, you should not have any issues.

It depends on what you mean by Unicode... Windows API and UI use UTF-16 (started with Windows-NT 4.0) but if you generate output for a SMTP/email/WEB you must use UTF-8. For UTF-16 you can use CStringW or std::wstring but for UTF-8 CStringA or std::string. UTF-8 is a multibyte string format but it has nothing to do with the old MBCS which depend on codepages. In this case using CSting depended on the UNICODE define to make the code UTF-16 aware is now out of time and can shoot you in the foot. Conversions between UTF-16 and UTF-8 can be done with the current MultiByteToWideChar and WideCharToMultiByte. But if you write more general software, do it with the stl:

wstring_convert> converter;

The bad thing is that the current C++ Visual Studio editor can't handle utf-8 string literals. It is a Windows application you know...