Re: printf functions fail with non-ascii characters
Re: printf functions fail with non-ascii characters
- Subject: Re: printf functions fail with non-ascii characters
- From: Alastair Houghton <email@hidden>
- Date: Wed, 5 Sep 2007 21:53:12 +0100
On 5 Sep 2007, at 19:22, Peter Mulholland wrote:
The ANSI standard has missed a big opportunity here - basically, if
you use any encoding other than the format that the standard library
expects, forget it - it breaks.
The wide character API is a legacy part of the C standard; it was
never designed to be used with Unicode. This isn't really ANSI's
fault, because (AFAIK) it predates Unicode, and in particular the
decision that Unicode really would need more than 65536 code points
(which creates problems if you use UTF-16 wchar_ts).
ANSI *could* adopt a Unicode API for the C language, but given the
complexity of doing so and the widespread availability of existing
library code for dealing with Unicode, it would be rather redundant.
For Apple this is UTF-32, which is
particularly moronic of Apple when everyone else, even their own Core
Foundation stuff, uses UTF-16!
That isn't true. If by "everyone else" you mean Microsoft (which
seems to be a very common "everyone else", particularly on Apple
mailing lists), then you should be aware that some Microsoft software
only supports UCS-2, not UTF-16 (i.e. they still don't support
surrogates everywhere; SQL Server, for instance, doesn't work right
in every instance). If by "everyone else", you were including other
Unixen or Linux (and you should), then it's also not true, because on
most of those platforms wchar_t is indeed 32-bits in size. And
that's completely ignoring the fact that Far Eastern versions of many
operating systems have in the past used totally different wide
character sets.
Certainly it's preferable to use UTF-16, because it takes up no more
space than UTF-32, even with characters outside of the BMP, and it
usually takes up half the space. Plus it's often no slower to deal
with UTF-16 properly than it is to deal with UTF-32 properly because
of the existence of combining characters, not to mention other
Unicode complications. However, all of this belongs outside of the C
standard library's wchar routines, because those fulfil a specific
purpose, namely ensuring that a single "multi-byte character" has a
one-to-one mapping with a "wide character". If you make wchar_t
contain UTF-16 code units, then you break that, because in that case
it's possible that you would need two wchar_t values for a single
"multi-byte character". That has lots of unpleasant consequences
because the wide character API was never designed to work that way.
Using the wchar routines is often a mistake anyway. You can't
guarantee that a wchar_t will necessarily contain a Unicode code
point value, because some systems (particularly in the Far East)
provide other wide character systems (e.g. some variant of JIS or
Big5). The wide character support in the C library is also missing a
large amount of functionality that you *need* if you're going to use
Unicode. For example, there's no way to tell the library when
comparing two wide character strings that you want to compare their
canonical representations, or that you'd like to compare them using
phone book ordering.
In practice, if you want good handling of Unicode characters, you
either (a) code it yourself, referring copiously to the Unicode book,
or (b) use a library like CoreFoundation or ICU. (b) is an *awful
lot* less effort...
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden