Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
- Subject: Re: iso-8859-1 over UTF8 (was: Re: cString deprecated!)
- From: Chris Hanson <email@hidden>
- Date: Wed, 4 Sep 2002 02:44:05 -0500
At 8:45 AM +0200 9/4/02, Allan Odgaard wrote:
As I said then I (only) use it when I need a "char *", e.g. for
sscanf(), regex-functions and similar.
I think you'd be pretty screwed here with an UTF8-string consisting
of Japanese characters, cause the functions look for control codes
and similar and are not multi-byte aware, thus it might easily
mistake a multi-byte sequence for one or more control sequences, or
part of a multi-byte sequence as the "argument" for a control code
etc.
No, it won't. Control codes are low ASCII values. Every byte in a
multi-byte UTF-8 sequence has its high bit set and is thus over 128,
so no software should mistake parts of a multi-byte sequence for
control codes. (This is part of why UTF-8 can take 3-4 bytes to
represent a single 2-byte Unicode character.) And UTF-8 strings are
safe to use with a NUL (ASCII 0) terminator, so they're safe to use
with all Standard C string functions.
The people that designed UTF-8 put quite a bit of thought into it.
Unless you have a *very* good reason *not* to use it in a *specific*
case, you should use it.
However, my statement was not meant to criticize the Mac, but merely
stating that for stuff that may cross platforms (e.g. some network
protocols doesn't allow you to specify an encoding scheme) then
iso-8859-1 is a rather safe bet.
As time goes on, UTF-8 is becoming the encoding of choice. I believe
all new Internet protocols, for instance, are not only required to be
8-bit clean but also to use UTF-8 as their encoding of choice. And I
believe all modern platforms either use Unicode strings natively (and
thus support UTF-8 encoding) or have the ability to translate UTF-8
strings into their native encoding.
Sorry, I really meant many of the tools accompanying the OS -- not
the kernel itself.
The tools tend to not care about encoding. However, Terminal.app
defaults to using UTF-8 encoding as of Jaguar. So your best bet is
to use UTF-8 encoding everywhere but specific cases where you know
you *must* use another encoding.
-- Chris
--
Chris Hanson | Email: email@hidden
bDistributed.com, Inc. | Phone: +1-847-372-3955
Making Business Distributed | Fax: +1-847-589-3738
http://bdistributed.com/ | Personal Email: email@hidden
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.