Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: printf functions fail with non-ascii characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: printf functions fail with non-ascii characters

Subject: Re: printf functions fail with non-ascii characters
From: Alastair Houghton <email@hidden>
Date: Wed, 5 Sep 2007 21:53:12 +0100

On 5 Sep 2007, at 19:22, Peter Mulholland wrote:

The ANSI standard has missed a big opportunity here - basically, if
you use any encoding other than the format that the standard library
expects, forget it - it breaks.

The wide character API is a legacy part of the C standard; it was never designed to be used with Unicode. This isn't really ANSI's fault, because (AFAIK) it predates Unicode, and in particular the decision that Unicode really would need more than 65536 code points (which creates problems if you use UTF-16 wchar_ts).

ANSI *could* adopt a Unicode API for the C language, but given the complexity of doing so and the widespread availability of existing library code for dealing with Unicode, it would be rather redundant.

For Apple this is UTF-32, which is
particularly moronic of Apple when everyone else, even their own Core
Foundation stuff, uses UTF-16!

That isn't true. If by "everyone else" you mean Microsoft (which seems to be a very common "everyone else", particularly on Apple mailing lists), then you should be aware that some Microsoft software only supports UCS-2, not UTF-16 (i.e. they still don't support surrogates everywhere; SQL Server, for instance, doesn't work right in every instance). If by "everyone else", you were including other Unixen or Linux (and you should), then it's also not true, because on most of those platforms wchar_t is indeed 32-bits in size. And that's completely ignoring the fact that Far Eastern versions of many operating systems have in the past used totally different wide character sets.

Certainly it's preferable to use UTF-16, because it takes up no more space than UTF-32, even with characters outside of the BMP, and it usually takes up half the space. Plus it's often no slower to deal with UTF-16 properly than it is to deal with UTF-32 properly because of the existence of combining characters, not to mention other Unicode complications. However, all of this belongs outside of the C standard library's wchar routines, because those fulfil a specific purpose, namely ensuring that a single "multi-byte character" has a one-to-one mapping with a "wide character". If you make wchar_t contain UTF-16 code units, then you break that, because in that case it's possible that you would need two wchar_t values for a single "multi-byte character". That has lots of unpleasant consequences because the wide character API was never designed to work that way.

Using the wchar routines is often a mistake anyway. You can't guarantee that a wchar_t will necessarily contain a Unicode code point value, because some systems (particularly in the Far East) provide other wide character systems (e.g. some variant of JIS or Big5). The wide character support in the C library is also missing a large amount of functionality that you *need* if you're going to use Unicode. For example, there's no way to tell the library when comparing two wide character strings that you want to compare their canonical representations, or that you'd like to compare them using phone book ordering.

In practice, if you want good handling of Unicode characters, you either (a) code it yourself, referring copiously to the Unicode book, or (b) use a library like CoreFoundation or ICU. (b) is an *awful lot* less effort...

Kind regards,

Alastair.

--
http://alastairs-place.net


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re[2]: printf functions fail with non-ascii characters
From: Peter Mulholland <email@hidden>
Re: printf functions fail with non-ascii characters
From: "Shawn Erickson" <email@hidden>


References:  
  >printf functions fail with non-ascii characters (From: "William H. Schultz" <email@hidden>)
  >Re: printf functions fail with non-ascii characters (From: Peter Mulholland <email@hidden>)




Prev by Date:
Re: command-line test for 64-bit hardware?

Next by Date:
Re: printf functions fail with non-ascii characters

Previous by thread:
Re: printf functions fail with non-ascii characters

Next by thread:
Re: printf functions fail with non-ascii characters

Index(es):

Date
Thread