Re: encoding of file names
Re: encoding of file names
- Subject: Re: encoding of file names
- From: Ken Thomases <email@hidden>
- Date: Wed, 25 May 2011 00:12:58 -0500
On May 24, 2011, at 11:09 PM, Quincey Morris wrote:
> On May 24, 2011, at 17:33, Ken Thomases wrote:
>
>>> I am sure this becomes more difficult with Arabic, Hebrew and Thai and other writing systems that have highly composed forms. (not sure if that's the right term)
>>
>> Not really.
>
> There *is* another level, described briefly here:
>
> http://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html
>
> As I understand things, there are at least 3 levels, informally at least:
>
> 1. Codepoints. Each Unicode codepoint is represented by 1, 2 or more 8, 16 or 32 bit values (UTF-8, UTF-16, etc). I don't know if the individual 8, 16 or 32 bit components have an official name. I call them "components".
>
> 2. Characters. Some Unicode characters consist of a base codepoint and one or more combining marks (accents). Some characters may representable as either a single codepoint (precomposed) or multiple codepoints (decomposed), and there are various normalization rule sets that specify the order and composition for various contexts.
>
> 3. Grapheme clusters. Some written units in some languages (such as Arabic, Hebrew and Thai) are made up of multiple characters.
>
> This means that, in general, a single grapheme cluster may consist of a variable number of characters, which may each consist of a variable number of Unicode codepoints, which may each consist of a variable number of components.
This is all correct, but seems to me to introduce stuff that's irrelevant to the current discussion, which was about comparing strings and, in particular, file paths. Grapheme clusters and surrogate pairs really only come into play when one is splitting strings or identifying indexes or sub-ranges corresponding to what users think of as characters. They don't affect comparison of strings for equality, although they may affect comparison for sorting for display.
Also, I wouldn't say that codepoints "may each consist of a variable number of components". They may be _encoded_ to a variable number of components, but don't "consist" of them.
> Within Cocoa, the "native" string capabilities happen to be implemented in terms 16 bit components whose type is 'unichar'. (Specifically, 'unichar' is *not* a Unicode character type, nor even a Unicode codepoint type. It's a raw component value. This is in spite of the fact that NSString methods that access these components refer to them, incorrectly, as "characters".)
>
> In class NSString, though, except when you specifically access individual components or use methods and options specifically relating to composition, strings are treated as *character* sequences, meaning that composition and normalization are generally handled transparently.
This last bit is not true. For the most part, NSString deals with sequences of UTF-16 units. It is the exception, not the norm, for NSString to transparently ignore differences in composition -- i.e. to treat their contents as characters.
> AFAIK the file system operates at level 2, which means that composition and normalization are *not* significant in file name comparisons, though files names *are* stored with a canonical composition and normalization.
>
> Ken, is that a correct statement of how it works?
This is basically correct. The file system APIs normalize all paths they receive to Apple's variant of NFD. So, the caller does not have to normalize it themselves in order to match a file path if, for example, they're trying to open an existing file.
Still, if you need to supply a file path to a C API, you should use one of the file-system-representation methods or functions. Likewise, if you receive a file path from a C API, you should use the appropriate file-system-representation-taking methods or functions to obtain an NSString or CFString object from it.
>> You just need to be aware of the semantics of the operations you're performing so you can pick the right one -- i.e. isEqual: and isEqualToString: perform literal comparision, while -compare: does not, and the -compare:options:... methods let you choose that as well as case-sensitivity, diacritic-sensitivity, and width-sensitivity.
>
> And "literal" means component by component. The NSString class reference describes 'NSLiteralSearch' like this:
>
>> Exact character-by-character equivalence.
>
> I've always understand this to mean unichar by unichar, i.e. component by component, since the NSString documentation generally refers to components as "characters".
Well, your definition of "components" are the result of a particular encoding (e.g. UTF-16). "Literal" means codepoint-by-codepoint equivalence. Since the encodings are one-to-one, that implies component-by-component equivalence, too.
Remember that UTF-16 determines the _interface_ of NSString, not necessarily its storage. Internally, it may be storing some strings in UTF-8. So, when you consider a comparison between two NSStrings, the important thing isn't the components, but the codepoints that they encode.
> Here's what the NSString class reference says about 'isEqualToString:':
>
>> The comparison uses the canonical representation of strings, which for a particular string is the length of the string plus the Unicode characters that make up the string. When this method compares two strings, if the individual Unicodes are the same, then the strings are equal, regardless of the backing store. “Literal” when applied to string comparison means that various Unicode decomposition rules are not applied and Unicode characters are individually compared. So, for instance, “Ö” represented as the composed character sequence “O” and umlaut would not compare equal to “Ö” represented as one Unicode character.
>
> This make absolutely no sense unless the word "character" is here understood to mean "component".
Well, I would say "codepoint" is more proper. "O", the combining umlaut (diaeresis), and "Ö" are all distinct codepoints. They are not components, although they can be represented by components.
> Under this interpretation, NSString has no real codepoint by codepoint comparison. However, I believe that each codepoint point is represented by a *unique* UTF-16 component sequence, so a literal comparison amounts to the same thing as a codepoint by codepoint comparison.
>
> Am I still on track here?
Well, you're correct that a component-by-component comparison is equivalent to a codepoint-by-codepoint comparison. I disagree that NSString doesn't have the latter. Because of the equivalence, I suppose it may be a matter of perspective.
Regards,
Ken
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden