My codecvt (a.k.a. facet) never gets called.
-
I have been at this all night and I am still no closer to a solution.:confused: The problem is that utf16_codecvt methods never get called and, therefore, the result is wrong. I have search the net, but all I can find is examples of what is supposed to work. Unfortunately none of them has worked. I have also seen other posters, on the net, with the same problem, but no one gave them and answer to it. I have tested to make sure that it has the facet (utf16_codecvt) and it does. So I see no reason why its virtual methods are never called. Instead it keeps calling the
codecvt<wchar_t,char, mbstate>
methods. Any ideas?class utf16_codecvt : public std::codecvt<char16_t, char16_t, std::mbstate_t>
{
...//
};void MyTestFunc()
{
... //
std::wifstream myFile;
std::locale myLoc = std::locale(myFile.getloc(), new utf16_codecvt);
myFile.imbue(myLoc);
myFile.open(pFileName, std::ios::in | std::ios::binary);
... //
myFile.read(bom_buffer, 1);
... //
}The following link gives an example of the types of things I am trying to do. April 01, 1999 - Unicode Files - P.J. Plauger http://www.ddj.com/cpp/184403638?pgno=1[^] Signed, Very <blanking> tired. :zzz:
INTP "Program testing can be used to show the presence of bugs, but never to show their absence."Edsger Dijkstra
-
I have been at this all night and I am still no closer to a solution.:confused: The problem is that utf16_codecvt methods never get called and, therefore, the result is wrong. I have search the net, but all I can find is examples of what is supposed to work. Unfortunately none of them has worked. I have also seen other posters, on the net, with the same problem, but no one gave them and answer to it. I have tested to make sure that it has the facet (utf16_codecvt) and it does. So I see no reason why its virtual methods are never called. Instead it keeps calling the
codecvt<wchar_t,char, mbstate>
methods. Any ideas?class utf16_codecvt : public std::codecvt<char16_t, char16_t, std::mbstate_t>
{
...//
};void MyTestFunc()
{
... //
std::wifstream myFile;
std::locale myLoc = std::locale(myFile.getloc(), new utf16_codecvt);
myFile.imbue(myLoc);
myFile.open(pFileName, std::ios::in | std::ios::binary);
... //
myFile.read(bom_buffer, 1);
... //
}The following link gives an example of the types of things I am trying to do. April 01, 1999 - Unicode Files - P.J. Plauger http://www.ddj.com/cpp/184403638?pgno=1[^] Signed, Very <blanking> tired. :zzz:
INTP "Program testing can be used to show the presence of bugs, but never to show their absence."Edsger Dijkstra
From what I can tell, the C++ stream system presumes that files are sequences of bytes, not characters - even when you use wide streams - the 'wide' part of wide stream (AFAICT) indicates how the stream object interacts with C++, not the underlying file or whatever. Thus, your codecvt facet has to take in characters. By changing the declaration of your codecvt facet to that shown below, I was able to get breakpoints in the replacement facet being set.
class utf16_codecvt : public std::codecvt<char16_t, char, std::mbstate_t>
{
typedef std::codecvt<char16_t, char, std::mbstate_t> Base;
typedef char16_t ElemT;
typedef char ByteT;
virtual result __CLR_OR_THIS_CALL do_in(std::mbstate_t& s,
const ByteT *_First1, const ByteT *_Last1, const ByteT *& _Mid1,
ElemT*_First2, ElemT* _Last2, ElemT *& _Mid2) const
{ // convert bytes [_First1, _Last1) to [_First2, _Last)
return Base::do_in(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
}virtual result __CLR_OR_THIS_CALL do_out(std::mbstate_t& s,
const ElemT*_First1, const ElemT*_Last1, const ElemT*& _Mid1,
ByteT*_First2, ByteT*_Last2, ByteT*& _Mid2) const
{ // convert [_First1, _Last1) to bytes [_First2, _Last)
return Base::do_out(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
}virtual result __CLR_OR_THIS_CALL do_unshift(std::mbstate_t& s,
ByteT*_First2, ByteT*_Last2, ByteT*&_Mid2) const
{ // generate bytes to return to default shift state
return Base::do_unshift(s, _First2, _Last2, _Mid2);
}virtual int __CLR_OR_THIS_CALL do_length(const std::mbstate_t& s, const ByteT*_First1,
const ByteT*_Last1, size_t _Count) const
{ // return min(_Count, converted length of bytes [_First1, _Last1))
return Base::do_length(s, _First1, _Last1, _Count);
}
};So, your replacement facet will have to know it needs two bytes read for every character (and vice versa, obviously). The best reference for that sort of information is probably Standard C++ IOStreams and Locales by Angelika Langer and Klaus Kreft[^] - but even then, locales and facets are heavy going in C++ :-(
Java, Basic, who cares
-
From what I can tell, the C++ stream system presumes that files are sequences of bytes, not characters - even when you use wide streams - the 'wide' part of wide stream (AFAICT) indicates how the stream object interacts with C++, not the underlying file or whatever. Thus, your codecvt facet has to take in characters. By changing the declaration of your codecvt facet to that shown below, I was able to get breakpoints in the replacement facet being set.
class utf16_codecvt : public std::codecvt<char16_t, char, std::mbstate_t>
{
typedef std::codecvt<char16_t, char, std::mbstate_t> Base;
typedef char16_t ElemT;
typedef char ByteT;
virtual result __CLR_OR_THIS_CALL do_in(std::mbstate_t& s,
const ByteT *_First1, const ByteT *_Last1, const ByteT *& _Mid1,
ElemT*_First2, ElemT* _Last2, ElemT *& _Mid2) const
{ // convert bytes [_First1, _Last1) to [_First2, _Last)
return Base::do_in(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
}virtual result __CLR_OR_THIS_CALL do_out(std::mbstate_t& s,
const ElemT*_First1, const ElemT*_Last1, const ElemT*& _Mid1,
ByteT*_First2, ByteT*_Last2, ByteT*& _Mid2) const
{ // convert [_First1, _Last1) to bytes [_First2, _Last)
return Base::do_out(s, _First1, _Last1, _Mid1, _First2, _Last2, _Mid2);
}virtual result __CLR_OR_THIS_CALL do_unshift(std::mbstate_t& s,
ByteT*_First2, ByteT*_Last2, ByteT*&_Mid2) const
{ // generate bytes to return to default shift state
return Base::do_unshift(s, _First2, _Last2, _Mid2);
}virtual int __CLR_OR_THIS_CALL do_length(const std::mbstate_t& s, const ByteT*_First1,
const ByteT*_Last1, size_t _Count) const
{ // return min(_Count, converted length of bytes [_First1, _Last1))
return Base::do_length(s, _First1, _Last1, _Count);
}
};So, your replacement facet will have to know it needs two bytes read for every character (and vice versa, obviously). The best reference for that sort of information is probably Standard C++ IOStreams and Locales by Angelika Langer and Klaus Kreft[^] - but even then, locales and facets are heavy going in C++ :-(
Java, Basic, who cares
Thanks, Stuart That worked great; I expected the problem was something like that. Something else I discovered was that the second template parameter has to be ‘char’ or it will not work. That is ‘unsigned char’ will not even work as the second parameter. I need to dig up a copy of the standard to see if this is compliant and makes since, because having template parameters that can only be of a single integral type is illogical.
INTP "Program testing can be used to show the presence of bugs, but never to show their absence."Edsger Dijkstra
-
Thanks, Stuart That worked great; I expected the problem was something like that. Something else I discovered was that the second template parameter has to be ‘char’ or it will not work. That is ‘unsigned char’ will not even work as the second parameter. I need to dig up a copy of the standard to see if this is compliant and makes since, because having template parameters that can only be of a single integral type is illogical.
INTP "Program testing can be used to show the presence of bugs, but never to show their absence."Edsger Dijkstra
John R. Shaw wrote:
Something else I discovered was that the second template parameter has to be ‘char’ or it will not work. That is ‘unsigned char’ will not even work as the second parameter. I need to dig up a copy of the standard to see if this is compliant and makes since, because having template parameters that can only be of a single integral type is illogical.
I think there are two pertinent ideas here - firstly, files are streams of bytes (that's the basic concept underlying file streams in C++), which is why they always convert to/from bytes. Secondly, codecvt facets can be used on their own, without streams. So, say you'd read in a file converting from a byte stream to (say) UCS-2. Then you want to write the UTF-32 equivalent to a file. You could use a codecvt facet that could convert from UCS-2 to UTF-32. The example code in the codecvt::in documentation on MSDN[^] shows this sort of scenario.
Java, Basic, who cares - it's all a bunch of tree-hugging hippy cr*p