Re: Problem with rangeOfString and Umlauts
Re: Problem with rangeOfString and Umlauts
- Subject: Re: Problem with rangeOfString and Umlauts
- From: Aandi Inston via Cocoa-dev <email@hidden>
- Date: Mon, 14 Mar 2022 13:59:04 +0000
This is largely from memory, so details might be wrong.
Normalisation is an insufficiently known thing to consider when working
with Unicode. (We all know that Unicode is a list of code points
(integers).
Here are some Unicode points for this discussion:
U+0065 "e" Latin Small Letter E
U+00E9 "é" Latin Small Letter E with Acute
U+0301 "◌́" (U+0301) Combining Acute accent - this may not display as
expected and may show a "ghost" dotted circle you won't see.
Many languages have accents that change letters, so we have "e" plus
"acute" to get "e acute". In Unicode there are two ways to get "e acute".
One is the single Unicode point U+00E9. The other is the TWO characters "e"
and "combining acute accent", so U+0065 followed by U+0301. U+0301 does not
take any space for itself, but dumps an acute accent over the character
before.(Not all accented letters have two representations like this, and
some have more than two).
So, what's the difference between U+00E9 versus U+0065 followed by U+0301?
They will look the same, but a string with the second form will be 1
character longer, and the offset of all character after it will be changed.
Are they equal? Well, no, not in simple terms because they are different
list of characters.
Do we get both? YES. The Mac OS file systems store the long form. On a
French keyboard, if you type e acute, you get the short form. If you copy
paste it could be either. This can be bad. For example if you get a list of
the files in a folder, and allow the user to type a name to choose the
file, there might not be a match, even though the user can see one.
To get over this, we have "canonical" forms. There are at least four forms,
C, D, KC and KD. precomposedStringWithCanonicalMapping converts to form C.
It doesn't really matter what it is, but if you run all your strings
through precomposedStringWithCanonicalMapping, then you will get more
expected results when comparing strings.
On Mon, 14 Mar 2022 at 13:05, Gabriel Zachmann via Cocoa-dev <
email@hidden> wrote:
> >
> > It’s hard to tell from the above snippet, but I suspect your strings are
> different in normalization.
>
> I suspected that, too, but I have no expertise in normalization.
>
> > Specifically, I suspect that file_basename uses two Unicode codepoints
> for the ä, and info_item uses only one.
> >
> > As an experiment, try running both strings through
> -precomposedStringWithCanonicalMapping before doing
>
> Thanks a lot!
> That seems to solve the problem.
> I have looked up the documentation of the method, but I don't really
> understand it.
> Could you explain it to me?
>
>
> To gain more insights (I usually think it's better to understand in depth
> what one is doing),
> here is some more background info:
>
> One of the strings is derived from a path, like so:
>
> query_ = [[NSMetadataQuery alloc] NSPredicate * predicate = [NSPredicate
> predicateWithFormat: @"(kMDItemContentTypeTree = 'public.image')" ];
> [query_ setSearchScopes: [NSArray arrayWithObject:
> mainDirectoryLocation_]];
> [query_ setPredicate: predicate];
> [query_ startQuery];
> .....
> NSString * img_filename = [[query_ resultAtIndex: i] valueForAttribute:
> @"kMDItemPath"];
> file_basename = [[img_filename lastPathComponent]
> stringByDeletingPathExtension];
>
>
>
> The other string is derived from EXIF data, like so:
>
> CFDictionaryRef tiff_dict;
> CFDictionaryGetValueIfPresent( fileProps, kCGImagePropertyTIFFDictionary,
> (const void **) & tiff_dict );
>
> CFDictionaryGetValueIfPresent( tiff_dict,
> kCGImagePropertyTIFFImageDescription,
> (const void **) &
> caption );
> info_item = [[NSString alloc] initWithString: (__bridge NSString *
> _Nonnull)(caption)];
>
>
> Could that explain the funny handling of umlauts I am seeing?
>
> In particular, in the debugger, I can see that the type of file_basename
> is actually an (NSPathStore2 *).
> Could that be a problem?
>
>
>
> Thanks a lot in advance.
>
> Best regards, Gabriel
>
>
> _______________________________________________
>
> Cocoa-dev mailing list (email@hidden)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden