Unicode mapping of strings: worth the effort?
Unicode mapping of strings: worth the effort?
- Subject: Unicode mapping of strings: worth the effort?
- From: James Quick <email@hidden>
- Date: Sun, 20 Jul 2003 19:44:52 -0400
Does anyone out there know about Unicode and the mapping functions?
I am designing some functions for processing the content of strings.
Some of my processing requires the use of the regular expressions,
and I may wish to use others in the printf family as well.
It seems that NSNonLossyAsciiStringEncoding, and NSUTF8StringEncoding
are the only encodings which guarantee that a 7 bit ascii character
value will
not occur anywhere in a multi-byte sequence, and won't destroy any
information
along the way. So, before calling std library string functions I
obviously need
to retrieve the string contents in one of those forms,
However, there is apparently still the possibility that two distinct
copies of a
string which are visually identical, may have two distinct binary
representations.
In order to be sure that the strings are byte-wise the same they need
to be
converted to a standard form. Unfortunately, there are 4 such standard
forms
and I am still not clear which to use, or whether it's worth using them
at all.
I read the NSString documentation, a number of faqs on the net, and the
technical report:
"UAX #15: Unicode Normalization"
http://www.unicode.org/reports/tr15/tr15-23.html
There are two versions of canonical mapping, and two versions of
compatible
mapping, and I am not sure I get the point. It seems that Canonical
mappings would
still look the same, where the Compatible forms would be more likely to
have the
same intention.
From the above spec:
"For example, the half-width and full-width katakana characters will
have the
same compatibility decomposition and are thus compatibility
equivalents;
however, they are not canonical equivalents."
Now I've always been an ascii guy, I don't know jack about Unicode, and
even less
about how common this kind of thing is likely to be. It may very well
be, that I never
do any mapping, but in that case I should at least be able to state to
my users,
sorry, no dice, just makie sure that strings which have multiple
composed or decomposed
representations use the same Unicode strings and you should be fine.
For now, I've documented the hole in the code, so that I know what's
missing and
what to tell users, but I am curious to know how likely a Dutch,
German, or Japanese
user may be to stumble on it. I can add it an a later version, but
first I need to know
if it is likely to be worth the cost. I'd rather have to document the
constraints on acceptable
input than slow things down for everybody, if the problem is not likely
be an issue.
Does anyone know whether mapping is likely to be an issue?
If it is, would you go for appearance or compatibilty?
My hunch is go for Canonical if I ever plan to redisplay it,
and go for Compatible otherwise, but I'm not sure.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.