Re: Normalisation of filenames
Re: Normalisation of filenames
- Subject: Re: Normalisation of filenames
- From: Aki Inoue <email@hidden>
- Date: Sun, 02 Apr 2017 15:58:42 -0700
> On Apr 2, 2017, at 1:50 AM, Gerriet M. Denkmann <email@hidden> wrote:
>
>>
>> On 2 Apr 2017, at 10:59, Aki Inoue <email@hidden> wrote:
>>
>>
>>> On Apr 1, 2017, at 4:57 PM, Gerriet M. Denkmann <email@hidden> wrote:
>>>
>>>
>>>> On 2 Apr 2017, at 06:33, Jens Alfke <email@hidden> wrote:
>>>>
>>>>
>>>>> On Apr 1, 2017, at 11:58 AM, Gerriet M. Denkmann <email@hidden> wrote:
>>>>>
>>>>> I think that the examples above show, that NSURL does indeed do something about normalising Unicode strings.
>>>>
>>>> That makes sense; I’d expect that one of the RFCs covering URLs describes normalization. Otherwise constructing URLs (for a REST API, say) could become quite ambiguous because you wouldn’t know which way to encode various Unicode characters.
>>>>
>>>>> But my point is that NSURL gets the normalisation wrong in this case; or at least that it is not very consistent in normalising strings.
>>>>
>>>> Yes, it does seem wrong that you can have two filenames that are treated as distinct by the filesystem, but whose URL.path properties produce identical NSStrings.
>>>
>>> Sorry, my explanation was not quite clear: these two filenames look absolutely identical, but as a sequence of Unicode code points, they are not (tone-mark and vowel are in different order).
>>>
>>> What puzzles me is that consonant + THAI CHARACTER MAI EK + THAI CHARACTER SARA UU gets normalised by NSURL to: consonant + THAI CHARACTER SARA UU + THAI CHARACTER MAI EK (note the different order), whereas consonant + THAI CHARACTER MAI EK + THAI CHARACTER SARA II is left unchanged.
>> Garret,
>>
>> This is the standard Unicode Normalization behavior. Each Unicode character is assigned the Unicode Combining Property, an integer value defining the canonical ordering of combining marks.
>>
>> The Unicode Combining Property for THAI CHARACTER SARA UU is 103, and THAI CHARACTER MAI EK 107. So, MAI EK always comes after SARA UU in the canonical order.
>>
>> On the other hand, THAI CHARACTER SARA II has the property value 0 which indicates the start of the reordering segment. That’s why the character is not reordered in respect to other Thai combining characters.
>>
>> Aki
>
> Thanks a lot for this explanation.
>
> I just read about Combining_Character_Class in <http://unicode.org/reports/tr44/#Validation_of_CCC <http://unicode.org/reports/tr44/#Validation_of_CCC>>.
>
> What I did not find was an explanation why all Thai top-vowels (+ THAI CHARACTER MAI HAN-AKAT) have Combining_Character_Class 0, Not_Reordered, whereas the bottom vowels have 103.
I’m not a linguistic expert, but my understanding for the Unicode combining class is that a pair of two characters can be in the same combining class when:
- the ordering of the two characters has the semantic value (changing the order changes the meaning, for example)
or
- they can never be attached to a base character at the same time linguistically and/or grammatically
> Another strange thing: the tone marks have 107, but THAI CHARACTER THANTHAKHAT has 0. (This sometimes occurs together with ิ, e.g. เกียรติ์, or ุ, e.g. บงสุ์ )
As far as I know, the class 0 Thai vowels can appear multiple times for a single consonant and their ordering has distinct meaning. So, these character must be in same Unicode combining class 0.
The Unicode specification is carefully crafted that the general rules for the combining class works universally (except for the Hebrew accent characters).
> If you have any links to an explanation for these (to me) rather strange decisions of the Unicode people, I would appreciate this very much.
Probably these questions could be appropriate for the Unicode ML <http://www.unicode.org/consortium/distlist.html <http://www.unicode.org/consortium/distlist.html>>.
There are many real linguistic experts there (some of them were actually there from the beginning Unicode) who should be able to answer your questions :)
Aki
>
>
> Kind regards,
>
> Gerriet.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden