Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: wstring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wstring

Subject: Re: wstring
From: Alastair Houghton <email@hidden>
Date: Mon, 22 Mar 2004 23:28:31 +0000

On 22 Mar 2004, at 16:34, Kevin Hoyt wrote:

> Now you've got me curious...
>
> Doesn't Mac OS X support the "C-UTF-8" locale?
> If the locale is set to "C-UTF-8", what does it mean to do mbstowcs(),
> thus converting a UTF-8 string to wchar_t?

All it means is that you've converted the string to wide character
format.  The point is that the C standard doesn't say what a wchar_t
contains; indeed, AFAIK it doesn't specify what the locale strings mean
either, with two exceptions, namely:

	"C" specifies the minimal environment for C translation, and

	"" specifies the locale-specific native environment

Anything else is implementation defined.

All it says about wchar_t itself is that it is

	"an integer type whose range of values can represent distinct codes
for all
	members of the largest extended character set specified among the
supported
	locales; the null character shall have the code value zero and each
member of
	the basic character set defined in 5.2.1 shall have a code value equal
to its
	value when used as the lone character in an integer character constant"

It doesn't place many restrictions on the extended character set, other
than that it is a superset of the basic character set, which must
include the alphanumeric and punctuation characters necessary for any C
program.

> The only way I know of to properly load up a wchar_t string is to use
> mbstowcs().  Unless mbstowcs() converts UTF-8 into a specific
> codepage, it
> must be a form of Unicode, we just don't know what that encoding is.

Not necessarily true (on any given platform, I mean).  There are
character encodings that are a superset of Unicode (or at least a
superset of UCS-2); Emacs, for instance, uses a 19-bit encoding.

It's also quite possible to come-up with other encodings without
"features" like CJK unification, or where there are multiple code
points for different variants of the same character (e.g. with and
without swashes).

> Is there any reason Apple can not say what encoding is used for wchar_t
> strings?  This information is sometimes useful, if for no other
> reason, so
> we can talk about this with other people and not be ignorant :-)

I just checked the libc sources and it looks to me like the data in a
wchar_t varies depending on the selected multibyte encoding... there
certainly aren't any tables visible that map Kanji, EUC or Big-5 into
Unicode, so relying on wchar_t containing Unicode values is probably
not a good idea.  Or, put more simply, *yes*, there is a reason that
Apple cannot say what encoding is used for wchar_t, because it depends
on the locale value.  (Note that nothing the C standard says implies
that the wchar_t values are valid across different locales, so this
behaviour doesn't violate the standard.)

Having said that, if __STDC_ISO_10646__ is defined by the
implementation, then you can assume that wchar_t values are ISO 10646
(equivalent to Unicode) character values.  Microsoft notwithstanding,
of course (I bet they erroneously define it, or if they don't now, I
bet they will at some point in the future).

Kind regards,

Alastair.

--
http://www.alastairs-place.net

[demime 0.98b removed an attachment of type application/pkcs7-signature which had a name of smime.p7s]
_______________________________________________
xcode-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/xcode-users
Do not post admin requests to the list. They will be ignored.

References:
	>Re: wstring (From: Kevin Hoyt <email@hidden>)

Prev by Date: Re: Stray \177 in program [Spin-off]
Next by Date: custom rules/steps and dependencies
Previous by thread: Re: wstring
Next by thread: Re: wstring
Index(es):
- Date
- Thread