On 14 Jan 2017, at 10:35 pm, has <email@hidden> wrote:
That's Unicode working as advertized; comparing for logical meaning, not just raw comparing codepoints:
"ꜵꜨꜲ" = "aoTzAA" --> true
But that's a different result to:
set theString to current application's NSString's stringWithString:"ꜵꜨꜲ" theString's compare:"aoTzAA" options:(current application's NSDiacriticInsensitiveSearch)
The 10.6 AS Release Notes say:
The various types of ignoring behavior for text comparisons are now defined using Unicode General Categories, not ASCII characters: ignoring punctuation ignores category P*: for example, left- and right-quotation marks are now ignored. However, the backtick character (` ) used to be ignored but is now considered, because Unicode classifies it as a symbol, not punctuation.
ignoring hyphens ignores category Pd: for example, em- and en-dashes are now ignored.
ignoring whitespace ignores category Z*, plus tab (\t ), return (\r ), and linefeed (\n ): for example, non-breaking spaces are now ignored.
No mention of diacritcals.
AS doesn't use what NSString uses, for example, because NSString is nasty old UCS2 that counts the number of raw codepoints, e.g. "é" may be reported as 1 or 2, depending on whether the underlying representation is composed (the Latin "é" glyph) or decomposed (ASCII "e" + "´" accent glyphs which are overlaid when displayed). Whereas AS always counts it as 1. Look into the old Carbon APIs that came over from OS9, as AS's Unicode capabilities either come from there or else from a 3rd-party project like ICU that was around at the time AS originally added Unicode support (AS 1.3.7?). It counts grapheme clusters, but that doesn't mean it doesn't use NSString (probably CFString). There's no reason they can't be using CFString and the equivalent of -enumerateSubstringsInRange:: with NSStringEnumerationByComposedCharacterSequences. Last tests I did weren't exhaustive, but they gave the same results as AS in various scenarios of composition.
|