On Dec 11, 2007, at 9:15 PM, David Dunham wrote: I've inherited some C++ code that uses L"foo" to specify strings. I'm well aware that this has some problems, but it's what I've got. In the process of trying to fix this and make it actually cross-platform and Unicode savvy, I have a question: what encoding does gcc use for the resulting string? I'm guessing this is not actually UTF-16 like the authors think (assuming they know about encodings).
The GNU documentation of gcc wide character encoding is here:
Basically, it's ISO 10646 in UCS-4, which is generally coherent with UTF-16 for the subset that is defined in ISO 10646. For promotion of 7-bit ASCII characters, which is what you'd usually find in source, it will be more than adequate.
But it is not UTF-16, as that would have variable word runs for each glyph, which is incompatible with the notion of a wchar_t. Prior to Xcode 3.0, gcc was generally unreliable when passed UTF-8 input; it's better in Leopard, and ISO 10646-set characters passed in L"foo" string literals will in most cases be expanded correctly, but many, many features of Unicode are not covered in ISO 10646 or expressible in UCS-4.
Chris |