Re: Normalisation of filenames
Re: Normalisation of filenames
- Subject: Re: Normalisation of filenames
- From: "Gerriet M. Denkmann" <email@hidden>
- Date: Sun, 02 Apr 2017 15:50:19 +0700
> On 2 Apr 2017, at 10:59, Aki Inoue <email@hidden> wrote:
>
>
>> On Apr 1, 2017, at 4:57 PM, Gerriet M. Denkmann <email@hidden> wrote:
>>
>>
>>> On 2 Apr 2017, at 06:33, Jens Alfke <email@hidden> wrote:
>>>
>>>
>>>> On Apr 1, 2017, at 11:58 AM, Gerriet M. Denkmann <email@hidden> wrote:
>>>>
>>>> I think that the examples above show, that NSURL does indeed do something about normalising Unicode strings.
>>>
>>> That makes sense; I’d expect that one of the RFCs covering URLs describes normalization. Otherwise constructing URLs (for a REST API, say) could become quite ambiguous because you wouldn’t know which way to encode various Unicode characters.
>>>
>>>> But my point is that NSURL gets the normalisation wrong in this case; or at least that it is not very consistent in normalising strings.
>>>
>>> Yes, it does seem wrong that you can have two filenames that are treated as distinct by the filesystem, but whose URL.path properties produce identical NSStrings.
>>
>> Sorry, my explanation was not quite clear: these two filenames look absolutely identical, but as a sequence of Unicode code points, they are not (tone-mark and vowel are in different order).
>>
>> What puzzles me is that consonant + THAI CHARACTER MAI EK + THAI CHARACTER SARA UU gets normalised by NSURL to: consonant + THAI CHARACTER SARA UU + THAI CHARACTER MAI EK (note the different order), whereas consonant + THAI CHARACTER MAI EK + THAI CHARACTER SARA II is left unchanged.
> Garret,
>
> This is the standard Unicode Normalization behavior. Each Unicode character is assigned the Unicode Combining Property, an integer value defining the canonical ordering of combining marks.
>
> The Unicode Combining Property for THAI CHARACTER SARA UU is 103, and THAI CHARACTER MAI EK 107. So, MAI EK always comes after SARA UU in the canonical order.
>
> On the other hand, THAI CHARACTER SARA II has the property value 0 which indicates the start of the reordering segment. That’s why the character is not reordered in respect to other Thai combining characters.
>
> Aki
Thanks a lot for this explanation.
I just read about Combining_Character_Class in <http://unicode.org/reports/tr44/#Validation_of_CCC>.
What I did not find was an explanation why all Thai top-vowels (+ THAI CHARACTER MAI HAN-AKAT) have Combining_Character_Class 0, Not_Reordered, whereas the bottom vowels have 103.
Another strange thing: the tone marks have 107, but THAI CHARACTER THANTHAKHAT has 0. (This sometimes occurs together with ิ, e.g. เกียรติ์, or ุ, e.g. บงสุ์ )
If you have any links to an explanation for these (to me) rather strange decisions of the Unicode people, I would appreciate this very much.
Kind regards,
Gerriet.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden