Re: printf functions fail with non-ascii characters
Re: printf functions fail with non-ascii characters
- Subject: Re: printf functions fail with non-ascii characters
- From: Alastair Houghton <email@hidden>
- Date: Thu, 6 Sep 2007 12:59:03 +0100
On 6 Sep 2007, at 09:01, Peter Mulholland wrote:
Wednesday, September 5, 2007, 9:53:12 PM, you wrote:
The wide character API is a legacy part of the C standard; it was
never designed to be used with Unicode. This isn't really ANSI's
fault, because (AFAIK) it predates Unicode, and in particular the
decision that Unicode really would need more than 65536 code points
(which creates problems if you use UTF-16 wchar_ts).
True, it would be a problem where wchar_t wouldn't neccesarily be able
to hold 1 character, but they COULD have made it work well enough for
90% of cases.
And in the process they would have broken any existing code that
relied on the semantics that they had already defined in the C standard.
Standards bodies generally don't break things that they've already
specified, for very good and hopefully obvious reasons.
It's a shame something like libunicode didn't take off.
You *have* looked at ICU, right? It's pretty much *the* canonical
implementation of the Unicode spec; indeed, while they only say on
the ICU pages that ICU "closely tracks the Unicode standard", the
fact is that there is an awful lot of interplay between the ICU
project and the Unicode standard itself. I'm pretty certain it's
used as a reference implementation in many cases.
An earlier version of the ICU code also formed the basis of the Java
Unicode support, I believe, though the Java version of ICU has
substantially extended that since.
ANSI *could* adopt a Unicode API for the C language, but given the
complexity of doing so and the widespread availability of existing
library code for dealing with Unicode, it would be rather redundant.
They should do so if they want Unicode to be adopted in a portable
fashion. Currently, there is no portable way to do it.
ICU is portable, including to Windows. Parts of CoreFoundation are
portable too. See:
<http://icu-project.org/>
<http://developer.apple.com/opensource/cflite.html>.
That isn't true. If by "everyone else" you mean Microsoft (which
seems to be a very common "everyone else", particularly on Apple
mailing lists), then you should be aware that some Microsoft software
only supports UCS-2, not UTF-16 (i.e. they still don't support
surrogates everywhere; SQL Server, for instance, doesn't work right
in every instance).
I'm not talking about software, and Microsoft have had Unicode in
their API since Windows 2000. They were also smart enough to make
their C widechar stuff work with the Win32 Unicode stuff
This is something of a historical accident, rather than a conscious
decision. At the time Microsoft was implementing Windows NT, Unicode
only supported 65536 code points. As a result, at the time they
wouldn't have been breaking the C library by using a 16-bit Unicode
wchar_t (and could legitimately claim that NT had Unicode support,
since at that point surrogate pairs didn't exist). This is also the
reason that a large amount of Microsoft software *still* only
supports UCS-2.
Once Microsoft had released Windows NT with the C library this way,
they couldn't then (easily) change things without breaking binary
compatibility, with obvious consequences.
- unlike Apple. I have to do EXTRA work if I want to mix the two..
not so on
Windows.
You *don't* want to mix C wide characters with other APIs. Not if
you want to be portable, anyway. And if you don't want to be
portable, why do you care anyway?
Using the wchar routines is often a mistake anyway. You can't
guarantee that a wchar_t will necessarily contain a Unicode code
point value, because some systems (particularly in the Far East)
provide other wide character systems (e.g. some variant of JIS or
Big5).
Very rarely do you want to parse just ONE char though, usually you are
printing strings. For this, the wchar routines COULD be workable.
Sure, you couldn't guarantee that wchar_t would hold one character,
but for most cases it would work just fine - a lot better than it does
now!
The wide character routines are a legacy solution to a somewhat
different problem. The fact that they use a 32-bit wchar_t on most
Unicode-enabled Unix systems means that a lot of legacy code from the
Far East (where these kinds of issues are *far* more important than
they ever have been for English speaking countries in particular)
will work without changes (provided you only use pre-composed
characters, which isn't too much of a problem for Chinese or
Japanese, or---if I remember correctly---Korean).
On Windows, such code will only work assuming you stick to UCS-2 (and
that will make a lot of people in China and Japan quite cross because
many peoples' names require characters outside the BMP... to get some
idea of how annoying that is, imagine if everyone called you "Potor"
because they couldn't type a letter "e").
In practice, if you want good handling of Unicode characters, you
either (a) code it yourself, referring copiously to the Unicode book,
or (b) use a library like CoreFoundation or ICU. (b) is an *awful
lot* less effort...
Basically what you're saying is, a) do a lot of hard work, or b) make
your code non-portable.
You seem to be under the wholly mistaken impression that the C
library's wide character routines could ever have been a portable way
to manipulate Unicode. They can't, because of (a) legacy use
[including with other character sets] and (b) the design of the wide
character API. The only thing they can portably be used for is the C
standard's "wide characters", which are of unspecified character
set... the only things you can portably do with wide characters are
those things specified in the C standard (i.e. converting to/from an
unspecified multi-byte character set and using them with the C
library wide character APIs).
Besides, ICU is portable. If you need Unicode support and need to
support multiple platforms, ICU is probably what you want.
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden