Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: printf functions fail with non-ascii characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: printf functions fail with non-ascii characters

Subject: Re: printf functions fail with non-ascii characters
From: Alastair Houghton <email@hidden>
Date: Thu, 6 Sep 2007 12:59:03 +0100

On 6 Sep 2007, at 09:01, Peter Mulholland wrote:

Wednesday, September 5, 2007, 9:53:12 PM, you wrote:

The wide character API is a legacy part of the C standard; it was
never designed to be used with Unicode.  This isn't really ANSI's
fault, because (AFAIK) it predates Unicode, and in particular the
decision that Unicode really would need more than 65536 code points
(which creates problems if you use UTF-16 wchar_ts).


True, it would be a problem where wchar_t wouldn't neccesarily be able
to hold 1 character, but they COULD have made it work well enough for
90% of cases.

And in the process they would have broken any existing code that relied on the semantics that they had already defined in the C standard.

Standards bodies generally don't break things that they've already specified, for very good and hopefully obvious reasons.

It's a shame something like libunicode didn't take off.

You *have* looked at ICU, right? It's pretty much *the* canonical implementation of the Unicode spec; indeed, while they only say on the ICU pages that ICU "closely tracks the Unicode standard", the fact is that there is an awful lot of interplay between the ICU project and the Unicode standard itself. I'm pretty certain it's used as a reference implementation in many cases.

An earlier version of the ICU code also formed the basis of the Java Unicode support, I believe, though the Java version of ICU has substantially extended that since.

ANSI *could* adopt a Unicode API for the C language, but given the
complexity of doing so and the widespread availability of existing
library code for dealing with Unicode, it would be rather redundant.


They should do so if they want Unicode to be adopted in a portable
fashion. Currently, there is no portable way to do it.

ICU is portable, including to Windows. Parts of CoreFoundation are portable too. See:

  <http://icu-project.org/>
  <http://developer.apple.com/opensource/cflite.html>.

That isn't true.  If by "everyone else" you mean Microsoft (which
seems to be a very common "everyone else", particularly on Apple
mailing lists), then you should be aware that some Microsoft software
only supports UCS-2, not UTF-16 (i.e. they still don't support
surrogates everywhere; SQL Server, for instance, doesn't work right
in every instance).
I'm not talking about software, and Microsoft have had Unicode in their API since Windows 2000. They were also smart enough to make their C widechar stuff work with the Win32 Unicode stuff

This is something of a historical accident, rather than a conscious decision. At the time Microsoft was implementing Windows NT, Unicode only supported 65536 code points. As a result, at the time they wouldn't have been breaking the C library by using a 16-bit Unicode wchar_t (and could legitimately claim that NT had Unicode support, since at that point surrogate pairs didn't exist). This is also the reason that a large amount of Microsoft software *still* only supports UCS-2.

Once Microsoft had released Windows NT with the C library this way, they couldn't then (easily) change things without breaking binary compatibility, with obvious consequences.

- unlike Apple. I have to do EXTRA work if I want to mix the two.. not so on Windows.

You *don't* want to mix C wide characters with other APIs. Not if you want to be portable, anyway. And if you don't want to be portable, why do you care anyway?

Using the wchar routines is often a mistake anyway.  You can't
guarantee that a wchar_t will necessarily contain a Unicode code
point value, because some systems (particularly in the Far East)
provide other wide character systems (e.g. some variant of JIS or
Big5).


Very rarely do you want to parse just ONE char though, usually you are
printing strings. For this, the wchar routines COULD be workable.
Sure, you couldn't guarantee that wchar_t would hold one character,
but for most cases it would work just fine - a lot better than it does
now!

The wide character routines are a legacy solution to a somewhat different problem. The fact that they use a 32-bit wchar_t on most Unicode-enabled Unix systems means that a lot of legacy code from the Far East (where these kinds of issues are *far* more important than they ever have been for English speaking countries in particular) will work without changes (provided you only use pre-composed characters, which isn't too much of a problem for Chinese or Japanese, or---if I remember correctly---Korean).

On Windows, such code will only work assuming you stick to UCS-2 (and that will make a lot of people in China and Japan quite cross because many peoples' names require characters outside the BMP... to get some idea of how annoying that is, imagine if everyone called you "Potor" because they couldn't type a letter "e").

In practice, if you want good handling of Unicode characters, you
either (a) code it yourself, referring copiously to the Unicode book,
or (b) use a library like CoreFoundation or ICU.  (b) is an *awful
lot* less effort...


Basically what you're saying is, a) do a lot of hard work, or b) make
your code non-portable.

You seem to be under the wholly mistaken impression that the C library's wide character routines could ever have been a portable way to manipulate Unicode. They can't, because of (a) legacy use [including with other character sets] and (b) the design of the wide character API. The only thing they can portably be used for is the C standard's "wide characters", which are of unspecified character set... the only things you can portably do with wide characters are those things specified in the C standard (i.e. converting to/from an unspecified multi-byte character set and using them with the C library wide character APIs).

Besides, ICU is portable. If you need Unicode support and need to support multiple platforms, ICU is probably what you want.

Kind regards,

Alastair.

--
http://alastairs-place.net

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: printf functions fail with non-ascii characters
From: "John H. Jenkins" <email@hidden>


References:  
  >printf functions fail with non-ascii characters (From: "William H. Schultz" <email@hidden>)
  >Re: printf functions fail with non-ascii characters (From: Peter Mulholland <email@hidden>)
  >Re: printf functions fail with non-ascii characters (From: Alastair Houghton <email@hidden>)
  >Re[2]: printf functions fail with non-ascii characters (From: Peter Mulholland <email@hidden>)




Prev by Date:
Re[2]: printf functions fail with non-ascii characters

Next by Date:
Re: command-line test for 64-bit hardware?

Previous by thread:
Re[2]: printf functions fail with non-ascii characters

Next by thread:
Re: printf functions fail with non-ascii characters

Index(es):

Date
Thread