Re: Possible Clang Bug in Initializing Wide String From String Literal?
Re: Possible Clang Bug in Initializing Wide String From String Literal?
- Subject: Re: Possible Clang Bug in Initializing Wide String From String Literal?
- From: Andreas Grosam <email@hidden>
- Date: Fri, 20 May 2011 11:16:17 +0200
Thank you all for your replies, I still think there is an issue.
On May 19, 2011, at 11:46 PM, Sean McBride wrote:
> Did you pass -finput-charset= ? Also, use %zu for size_t, though I'm
> sure that's not the problem.
No, just the default settings from a Xcode 4 target template C console application.
For gcc and clang, -finput-charset is not set.
On May 19, 2011, at 11:17 PM, Mark Wagner wrote:
> Your source code isn't ASCII. How the compiler handles the non-ASCII
> bits (the string literal) is implementation-defined.
>
> --
> Mark Wagner
Yes, this is implementation defined (C99, 5.1.1.2 (5)). Nonetheless, since it is ultimately *defined* the behavior should be deterministic:
From Apple's gcc documentation:
-fexec-charset=charset
Set the execution character set, used for string and character constants. The default is UTF-8. charset can be any encoding supported by the system's iconvlibrary routine.
-fwide-exec-charset=charset
Set the wide execution character set, used for wide string and character constants. The default is UTF-32 or UTF-16, whichever corresponds to the width ofwchar_t. As with -fexec-charset, charset can be any encoding supported by the system's iconv library routine; however, you will have problems with encodings that do not fit exactly in wchar_t.
-finput-charset=charset
Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine.
My locale is:
$ locale
LANG=
...
LC_CTYPE="UTF-8"
...
The source is written in UTF-8. So, there should be no misinterpretation possible and it seems, in my current environment and in Xcode and during this build, gcc uses UTF-8 as input character set.
I couldn't find any particular documentation for clang, but it should be compatible with gcc. The results are not, though. It seems clang uses a different input charset as gcc would use and possible a different execution charsets as well.
On May 19, 2011, at 11:46 PM, Chris Hanson wrote:
> This includes normalization, decomposition, etc. So GCC may have compiled the wide string literal as one composed "ü" character, while clang may have compiled it as one "u" character and one "¨" combining mark. Both are legal representations and perfectly valid UTF-8.
The source file is encoded in UTF-8. In the given environment, GCC uses UTF-8 as input source charset, and UTF-32 as execution charset for sequences of wchar_t. The result should be completely deterministic. And gcc actual does what one could expect.
Both arrays were initialized as if I have written:
char s[] = "\xC3\xBC"; // = "\xC3\xBC" - represents 'ü' in UTF-8
wchar_t ws[] = L"\xC3\xBC";
The two bytes is a valid UTF-8 multibyte sequence -- assuming the input source is encoded in UTF-8.
I think the difference is, that gcc treats the input source as UTF-8, concludes that this is one valid character which can be represented in one wchar_t. It then initializes the wchar_t array correspondingly as defined by the mbstowcs function with an implementation defined locale. (see. 6.4.5).
In clang (and I'm guessing now), the default source charset seems to be ASCII. For clang these two bytes are individual bytes and *each byte* will be converted individually to a wchar_t the same way.
>
> It's even possible the (de)composition of the source file changed if you opened and saved it in an editor between your two compilations.
Of course, the source file will not be changed when compiling with different compilers. However, the individual bytes of the source file can be interpreted differently - regardless of the actual encoding of the source file. For sure, gcc treats it as UTF-8 - and it IS UTF-8.
>
> -- Chris
So, I still believe there is an undesired difference in behavior - although clang may not misbehave according the standard. And only because this behavior is no-where documented. This lack of documentation is, IMHO, a documentation bug though, since it shall be "implementation defined" - but it isn't defined.
Andreas _______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden