Re: Problem with rangeOfString and Umlauts
Re: Problem with rangeOfString and Umlauts
- Subject: Re: Problem with rangeOfString and Umlauts
- From: Gabriel Zachmann via Cocoa-dev <email@hidden>
- Date: Mon, 14 Mar 2022 17:04:28 +0100
Thanks a lot for your insights!
(I'm cc'ing the mailinglist, just in case someone else later stumbles across
this.)
Best, G.
> On 14. Mar 2022, at 14:57, Aandi Inston <email@hidden> wrote:
>
> This is largely from memory, so details might be wrong.
> Normalisation is an insufficiently known thing to consider when working with
> Unicode. (We all know that Unicode is a list of code points (integers).
>
> Here are some Unicode points for this discussion:
> U+0065 "e" Latin Small Letter E
> U+00E9 "é" Latin Small Letter E with Acute
> U+0301 "◌́" (U+0301) Combining Acute accent - this may not display as expected
> Many languages have accents that change letters, so we have "e" plus "acute"
> to get "e acute". In Unicode there are two ways to get "e acute". One is the
> single Unicode point U+00E9. The other is the TWO characters "e" and
> "combining acute accent", so U+0065 followed by U+0301. U+0301 does not take
> any space for itself, but dumps an acute accent over the character
> before.(Not all accented letters have two representations like this, and some
> have more than two).
> So, what's the difference between U+00E9 versus U+0065 followed by U+0301?
> They will look the same, but a string with the second form will be 1
> character longer, and the offset of all character after it will be changed.
> Are they equal? Well, no, not in simple terms because they are different list
> of characters.
> Do we get both? YES. The Mac OS file systems store the long form. On a French
> keyboard, if you type e acute, you get the short form. If you copy paste it
> could be either. This can be bad. For example if you get a list of the files
> in a folder, and allow the user to type a name to choose the file, there
> might not be a match, even though the user can see one.
> To get over this, we have "canonical" forms. There are at least four forms,
> C, D, KC and KD. precomposedStringWithCanonicalMapping converts to form C. It
> doesn't really matter what it is, but if you run all your strings through
> precomposedStringWithCanonicalMapping, then you will get more expected
> results when comparing strings.
>
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden