Re: Unicode mapping of strings: worth the effort?
Re: Unicode mapping of strings: worth the effort?
- Subject: Re: Unicode mapping of strings: worth the effort?
- From: Pete French <email@hidden>
- Date: Mon, 21 Jul 2003 18:23:43 +0100
>
Does anyone out there know about Unicode and the mapping functions?
In general, yes.
>
along the way. So, before calling std library string functions I
>
obviously need to retrieve the string contents in one of those forms,
You want to use UTF-8 - its the most unix-compatible of the encoding forms
and is generally what people use. Its also the default character set for XML
>
There are two versions of canonical mapping, and two versions of
>
compatible
>
mapping, and I am not sure I get the point. It seems that Canonical
>
mappings would
Basically it comes down to you needing to define what 'the same' means for
your application. What you probably wannt to do is to use the decomposed
compatible forms of all the strings. What I do is hold all my strings in this
form interbaly, and I only make the conversion on anything which the user types
in.
Using the decomposed form wont loose anything - what you get is if
a single charcetr is types for something like an A with a circle over the
top then that will be decomposed to an english capital A followed by the
character for the circle. Similarly with thingslike an accented 'e' at the
end of 'cafe'. One of the advantgaes of that is that you can crudely strip
out the 7 bit ASCI for printing and you just loose the accents off the top of
the letters rather than whole letters.
The compatibility form takes single characters which represent pairs of
letters and turns them into the component letters. It also replaces letters
with equivalent letters. For example a '2' superscripted just becomes anormal
charcter '2'. Whether this is what you want rather depends on whether you
expect your users to be typing in ligatures and superscripted characters,
and what you would then do with the text under those circumstances (i.e.
whether loosing the superscript matters). I always map the compatibility
characters so that if someone cust andpastes the word 'fine' from a document
were the'fi' is a single ligatured character then it comes out as the word
'f','i','n','e' in my code.
>
Now I've always been an ascii guy, I don't know jack about Unicode, and
>
even less about how common this kind of thing is likely to be. It may very
Pretty common - theres more accents in non-US english than you might think
never mind any of the other languages. But whether those cases are going
to be common in your specific case rather depends on the app and the uses
to which it is put. If you are doing anything where a customer is going to type
in their name then you are going to end up with lots of accented characters
though, and people get very irritated if their names are spelled wrongly :-)
>
My hunch is go for Canonical if I ever plan to redisplay it,
>
and go for Compatible otherwise, but I'm not sure.
Sounds reasonable - though depending on the data you dont really loose
much by using compatible all the time. If someone did type a supescripted
character into an input box would they really be justified in execting that
superscripting to be carried all the way through the code ?
-bat.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.