Re: Ignore accents when comparing strings
Re: Ignore accents when comparing strings
- Subject: Re: Ignore accents when comparing strings
- From: glenn andreas <email@hidden>
- Date: Thu, 6 Jan 2005 09:12:32 -0600
On Jan 6, 2005, at 4:26 AM, R.Claeson wrote:
On 6 Jan 2005, at 02:21, Brendan Younger wrote:
I think (mostly) everyone here is missing the point. The accented
"a", also known as U+0030, does not compare before the word "arc" as
it should (especially since it compares equal to a non-accented "a").
The original poster had a legitimate problem which had nothing to do
with file encodings but that seems to be all everyone is talking
about. The question is, is it a bug that the accented "a" compares
after the word "arc" or is it a misunderstanding?
It might be true that an accented "a" should compare before the word
"arc" in some languages, but not in all. The behavior is (or should
be) locale dependent. While the character "ä" is just an accented "a"
(a umlaut) in some languages and should be sorted together with "a" in
some languages (German comes to mind), it is a completely separate
character in its own right in other languages and has its own location
in the alphabet in others and must not be sorted together with a plan
"a" (Swedish comes to mind).
For all the gory details (and then some) on sorting and Unicode, check
out http://www.unicode.org/unicode/reports/tr10
Basically, sorting takes a codepoint, decomposed it, converts that to a
tuple (the collation elements) and then combines those collation
elements from each element of the string (essentially shuffles together
the first element of each tuple, then the next, etc... skipping any
element whose value is zero), and finally compares that resulting sort
key. All made more complicated by regional variations and things that
case "reverse" sort ordering (like French accents).
It's fascinating reading, and includes the following:
1.8 Common Misperceptions
There are a number of common misperceptions about collation.
1.
Collation is not aligned with character sets or repertoires of
characters. Swedish and German share most of the same characters, for
example, but have very different sorting orders.
2.
Collation is not code point (binary) order. The simplest case of this
is capital Z vs. lowercase a.[snip]
3.
Collation is not a property of strings. Consider a list of cities, with
each city correctly tagged with its language. Despite this, a German
user will expect to see the cities all sorted according to German
order, and not expect to see a word with ö appear after z, simply
because the city has a Swedish name. Of crucial importance is that if a
German businessman makes a database selection, such as to sum up
revenue in each of of the cities from O... to P... for planning
purposes, then cities starting with Ö must not be excluded.
4.
Collation order is not preserved under concatenation or substring
operations, in general. For example, the fact that x is less than y
does not mean that x + z is less than y + z. This is because characters
may form contractions across the substring or concatenation boundaries.
In summary, the following shows which implications not to expect.
[snip]
Regardless, according to this what happens when I try to use the
unicode collation algorithm to compare 'á' and 'arc' it looks like 'á'
should appear before it:
00E0 ; [.0E33.0020.0002.0061][.0000.0035.0002.0300] # LATIN SMALL
LETTER A WITH GRAVE; QQCM
0061 ; [.0E33.0020.0002.0061] # LATIN SMALL LETTER A
0072 ; [.0FC0.0020.0002.0072] # LATIN SMALL LETTER R
0063 ; [.0E60.0020.0002.0063] # LATIN SMALL LETTER C
the string 'a' has a sort key (assuming "regular" algorithmic collation
with no localized collation tables etc...) of:
0E33 0000 0020 0000 0002 0000
'á' has a sort key of:
0E33 0000 0020 0035 0000 0002 0002 0000
'arc' has a sort key of:
0E33 0FC0 0E60 0000 0020 0020 0020 0000 0002 0002 0002 0000
As a result, the ordering should be 'a', 'á', 'arc'
Glenn Andreas email@hidden
<http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden