Re: encoding of file names
Re: encoding of file names
- Subject: Re: encoding of file names
- From: Quincey Morris <email@hidden>
- Date: Tue, 24 May 2011 21:09:44 -0700
On May 24, 2011, at 17:33, Ken Thomases wrote:
>> I am sure this becomes more difficult with Arabic, Hebrew and Thai and other writing systems that have highly composed forms. (not sure if that's the right term)
>
> Not really.
There *is* another level, described briefly here:
http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
As I understand things, there are at least 3 levels, informally at least:
1. Codepoints. Each Unicode codepoint is represented by 1, 2 or more 8, 16 or 32 bit values (UTF-8, UTF-16, etc). I don't know if the individual 8, 16 or 32 bit components have an official name. I call them "components".
2. Characters. Some Unicode characters consist of a base codepoint and one or more combining marks (accents). Some characters may representable as either a single codepoint (precomposed) or multiple codepoints (decomposed), and there are various normalization rule sets that specify the order and composition for various contexts.
3. Grapheme clusters. Some written units in some languages (such as Arabic, Hebrew and Thai) are made up of multiple characters.
This means that, in general, a single grapheme cluster may consist of a variable number of characters, which may each consist of a variable number of Unicode codepoints, which may each consist of a variable number of components.
Within Cocoa, the "native" string capabilities happen to be implemented in terms 16 bit components whose type is 'unichar'. (Specifically, 'unichar' is *not* a Unicode character type, nor even a Unicode codepoint type. It's a raw component value. This is in spite of the fact that NSString methods that access these components refer to them, incorrectly, as "characters".)
In class NSString, though, except when you specifically access individual components or use methods and options specifically relating to composition, strings are treated as *character* sequences, meaning that composition and normalization are generally handled transparently.
NSString only deals with grapheme clusters in a limited way ('rangeOfComposedCharacterSequence...'). For more sophisticated capabilities, you need to move up to the Text system.
The document I linked to above also talks about a fourth level, which is related to text transformations such as upper- and lower-casing, which add another level of length variability in representation (the number of grapheme clusters in upper and lower case representation of the same text may be different).
AFAIK the file system operates at level 2, which means that composition and normalization are *not* significant in file name comparisons, though files names *are* stored with a canonical composition and normalization.
Ken, is that a correct statement of how it works?
> You just need to be aware of the semantics of the operations you're performing so you can pick the right one -- i.e. isEqual: and isEqualToString: perform literal comparision, while -compare: does not, and the -compare:options:... methods let you choose that as well as case-sensitivity, diacritic-sensitivity, and width-sensitivity.
And "literal" means component by component. The NSString class reference describes 'NSLiteralSearch' like this:
> Exact character-by-character equivalence.
I've always understand this to mean unichar by unichar, i.e. component by component, since the NSString documentation generally refers to components as "characters".
Here's what the NSString class reference says about 'isEqualToString:':
> The comparison uses the canonical representation of strings, which for a particular string is the length of the string plus the Unicode characters that make up the string. When this method compares two strings, if the individual Unicodes are the same, then the strings are equal, regardless of the backing store. “Literal” when applied to string comparison means that various Unicode decomposition rules are not applied and Unicode characters are individually compared. So, for instance, “Ö” represented as the composed character sequence “O” and umlaut would not compare equal to “Ö” represented as one Unicode character.
This make absolutely no sense unless the word "character" is here understood to mean "component".
Under this interpretation, NSString has no real codepoint by codepoint comparison. However, I believe that each codepoint point is represented by a *unique* UTF-16 component sequence, so a literal comparison amounts to the same thing as a codepoint by codepoint comparison.
Am I still on track here?
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden