• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Problem with rangeOfString and Umlauts
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem with rangeOfString and Umlauts


  • Subject: Re: Problem with rangeOfString and Umlauts
  • From: Gabriel Zachmann via Cocoa-dev <email@hidden>
  • Date: Mon, 14 Mar 2022 17:04:28 +0100

Thanks a lot for your insights!

(I'm cc'ing the mailinglist, just in case someone else later stumbles across
this.)

Best, G.


> On 14. Mar 2022, at 14:57, Aandi Inston <email@hidden> wrote:
>
> This is largely from memory, so details might be wrong.
> Normalisation is an insufficiently known thing to consider when working with
> Unicode.  (We all know that Unicode is a list of code points (integers).
>
> Here are some Unicode points for this discussion:
> U+0065 "e" Latin Small Letter E
> U+00E9 "é" Latin Small Letter E with Acute
> U+0301 "◌́" (U+0301) Combining Acute accent - this may not display as expected
> Many languages have accents that change letters, so we have "e" plus "acute"
> to get "e acute". In Unicode there are two ways to get "e acute". One is the
> single Unicode point U+00E9. The other is the TWO characters "e" and
> "combining acute accent", so U+0065 followed by U+0301. U+0301 does not take
> any space for itself, but dumps an acute accent over the character
> before.(Not all accented letters have two representations like this, and some
> have more than two).
> So, what's the difference between U+00E9 versus U+0065 followed by U+0301?
> They will look the same, but a string with the second form will be 1
> character longer, and the offset of all character after it will be changed.
> Are they equal? Well, no, not in simple terms because they are different list
> of characters.
> Do we get both? YES. The Mac OS file systems store the long form. On a French
> keyboard, if you type e acute, you get the short form. If you copy paste it
> could be either. This can be bad. For example if you get a list of the files
> in a folder, and allow the user to type a name to choose the file, there
> might not be a match, even though the user can see one.
> To get over this, we have "canonical" forms. There are at least four forms,
> C, D, KC and KD. precomposedStringWithCanonicalMapping converts to form C. It
> doesn't really matter what it is, but if you run all your strings through
> precomposedStringWithCanonicalMapping, then you will get more expected
> results when comparing strings.
>

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References: 
 >Problem with rangeOfString and Umlauts (From: Gabriel Zachmann via Cocoa-dev <email@hidden>)
 >Re: Problem with rangeOfString and Umlauts (From: Matt Jacobson via Cocoa-dev <email@hidden>)
 >Re: Problem with rangeOfString and Umlauts (From: Gabriel Zachmann via Cocoa-dev <email@hidden>)

  • Prev by Date: Re: Problem with rangeOfString and Umlauts
  • Next by Date: New (?) behavior of screensavers wrt. keyboard clicks
  • Previous by thread: Re: Problem with rangeOfString and Umlauts
  • Next by thread: Re: Problem with rangeOfString and Umlauts
  • Index(es):
    • Date
    • Thread