Re: iconv (libiconv.dylib) broken
Re: iconv (libiconv.dylib) broken
- Subject: Re: iconv (libiconv.dylib) broken
- From: Andreas Grosam <email@hidden>
- Date: Wed, 10 Feb 2010 00:44:44 +0100
On Feb 9, 2010, at 11:24 PM, Jonas Maebe wrote:
>
> On 09 Feb 2010, at 22:52, Andreas Grosam wrote:
>
>> On Feb 9, 2010, at 9:50 PM, Jonas Maebe wrote:
>>
>>>
>>> On 09 Feb 2010, at 20:58, Andreas Grosam wrote:
>>>
>>>> after experimenting with the iconv library it seems that it is broken on Mac OS X.
>>>
>>> It works fine, but but you have to call setlocale(LC_ALL,"") before calling any iconv routines.
>> Unfortunately, this does not work for me.
>> setlocale(LC_ALL,"") will set the global "C" locale on my system, since the corresponding environment variables (LC_* and LANG) are not set.
>
> Are you perhaps on 10.5? It works for me on 10.6 if I include the setlocale call, even with a C locale (including explicitly setting LC_TYPE to C, as it's set to UTF-8 even if you export LANG=C). On 10.5, it indeed only works if the locale is UTF-8 (or if you explicitly pass an UTF-8 locale as second argument to setlocale, such as en_US.UTF-8).
I'm on Mac OS X version 10.6.2
The LC_* and LANG environment strings are not set. setlocale(LC_ALL,"") returns "C".
I've inserted the setlocale call like you suggested:
int main (int argc, const char * argv[])
{
char* current_locale = setlocale(LC_ALL,"");
...
current_locale yields "C".
But it didn't make a difference, when converting:
wtest= convertToWstring("T\xC3\xBCT", 4, "UTF-8", "WCHAR_T"); // "TüT"
check(L"T\u00fcT", wtest);
output in the debug console (commented):
--- iconv start ---
in buffer: 54 c3 bc 54 // this is the utf-8 string "TüT" as input
out buffer: 54 00 00 00 c3 00 00 00 54 00 00 00 // this is what iconv returns as byte sequence
wchar buffer: 00000054 000000c3 00000054 // the same as wchar_t sequence
--- iconv end ---
conversion failed:
ought to: 00000054 000000fc 00000054 // comparison with expected result and
test: 00000054 000000c3 00000054 // actual result as wchar_t sequence.
Interesting:
When I change the fromCharset from "WCHAR_T" to "UCS-4-INTERNAL" the conversion is correct (in my limited test case). WCHAR_T and UCS-4-INTERNAL should be effectively identical - since GCC uses UCS-4 for internal wchar_t encoding.
wtest= convertToWstring("T\xC3\xBCT", 4, "UTF-8", "UCS-4-INTERNAL"); // "TüT"
check(L"T\u00fcT", wtest);
-> Conversion is correct!
>
> I've included my modified version of your program below. If I comment out the setlocale call, it fails. With the call, it works (on 10.6, or on 10.5 with an UTF-8 locale -- and it always works on both if I pass "en_US.UTF-8" as second parameters to setlocale).
To be frank, this sounds strange - if not buggy ;)
I dug into the sources of iconv and found no hint that setlocale will make a difference either. It will be called, but as far as I can see only to figure the current locale for getting defaults. Nonetheless, I can be wrong there. It could also be a nasty side effect, I don't know.
But, I found also this:
...
if (ap->encoding_index == ei_local_wchar_t) {
/* On systems which define __STDC_ISO_10646__, wchar_t is Unicode.
This is also the case on native Woe32 systems. */
#if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__)
if (sizeof(wchar_t) == 4) {
index = ei_ucs4internal;
break;
}
if (sizeof(wchar_t) == 2) {
index = ei_ucs2internal;
break;
}
if (sizeof(wchar_t) == 1) {
index = ei_iso8859_1;
break;
}
#endif
}
index = ap->encoding_index;
...
This requires an explanation:
(ap->encoding_index == ei_local_wchar_t) will be true if the fromCharset is set to "WCHAR_T".
So, if the following condition
#if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__)
evaluates to true, then the fromCharset would be effectively become ei_ucs4internal.
And, as already tested, ei_ucs4internal means "UCS-4-INTERNAL", and this works!
OK, it seems, the condition DOESN'T evaluate to true. In fact, __STDC_ISO_10646__ is not defined. Usually, this is ALWAYS defined in GCC - since all the internal wchar_t number crunching is UCS-4 conform. If I'm right, this is an gcc built-in definition, defined when compiling the compiler (not 100% sure, though).
But on __APPLE__ , __STDC_ISO_10646__ for some reason is not defined. If this is actually correct and intended, this would mean, that Apple deliberately uses a different internal wchar_t encoding, making all of the libraries from GNU which deal with wchar_t incompatible. And this is a lot - effectively everything dealing with encoding and charsets.
So, this may explain why I get an incorrect conversion: for some reason, the obvious conversion type (ei_ucs4internal) will not be chosen when the fromCharset is specified as "WCHAR_T". It evaluates to ei_local_wchar_t - which I were unable to figure out until yet why exactly this then fails. But it fails.
It does, however, not explain why you get a correct conversion. Could you lease triple check again if you really get a correct conversion on your system ? :) I honestly cannot imagine why the global locale should affect the result, even more when it is set to "C".
If the conversions is incorrect you get:
// TüT:
--- iconv start ---
in buffer: 54 c3 bc 54
out buffer: 54 00 00 00 c3 00 00 00 54 00 00 00
wchar buffer: 00000054 000000c3 00000054
--- iconv end ---
conversion failed:
ought to: 00000054 000000fc 00000054
test: 00000054 000000c3 00000054
otherwise:
// TüT
--- iconv start ---
in buffer: 54 c3 bc 54
out buffer: 54 00 00 00 fc 00 00 00 54 00 00 00
wchar buffer: 00000054 000000fc 00000054
--- iconv end ---
>
>> Anyway, the POSIX global locale shouldn't have any effect in iconv. Iconv converts character strings from one specified locale to another specified locale and is thus independent on the POSIX locale which is always global for the process.
>
>
> You're right. I was misremembering the situation in which I ran into a similar problem (in which case I was calling nl_langinfo to determine the target locale for iconv calls, where you do have to call setlocale first since otherwise nl_langinfo may simply return info about the C locale)
>
>> I should also mention, that I convert from the "WCHAR_T" locale to "UTF-8", in which case the conversion fails.
>
> It seems that this conversion is somehow bound to setlocale, even though it indeed shouldn't be.
Yep, this is strange. I smell a bug ;)
Thank you very much for you help! Much appreciated! :)
(btw, didn't get your modified sources, maybe they are stripped by the mail-system)
>
>
> Jonas _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> Xcode-users mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden