Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Ignore accents when comparing strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ignore accents when comparing strings

Subject: Re: Ignore accents when comparing strings
From: glenn andreas <email@hidden>
Date: Thu, 6 Jan 2005 09:12:32 -0600


On Jan 6, 2005, at 4:26 AM, R.Claeson wrote:

On 6 Jan 2005, at 02:21, Brendan Younger wrote:
I think (mostly) everyone here is missing the point. The accented "a", also known as U+0030, does not compare before the word "arc" as it should (especially since it compares equal to a non-accented "a"). The original poster had a legitimate problem which had nothing to do with file encodings but that seems to be all everyone is talking about. The question is, is it a bug that the accented "a" compares after the word "arc" or is it a misunderstanding?
It might be true that an accented "a" should compare before the word "arc" in some languages, but not in all. The behavior is (or should be) locale dependent. While the character "ä" is just an accented "a" (a umlaut) in some languages and should be sorted together with "a" in some languages (German comes to mind), it is a completely separate character in its own right in other languages and has its own location in the alphabet in others and must not be sorted together with a plan "a" (Swedish comes to mind).

For all the gory details (and then some) on sorting and Unicode, check out http://www.unicode.org/unicode/reports/tr10

Basically, sorting takes a codepoint, decomposed it, converts that to a tuple (the collation elements) and then combines those collation elements from each element of the string (essentially shuffles together the first element of each tuple, then the next, etc... skipping any element whose value is zero), and finally compares that resulting sort key. All made more complicated by regional variations and things that case "reverse" sort ordering (like French accents).

It's fascinating reading, and includes the following:


1.8 Common Misperceptions

There are a number of common misperceptions about collation. 1. Collation is not aligned with character sets or repertoires of characters. Swedish and German share most of the same characters, for example, but have very different sorting orders. 2. Collation is not code point (binary) order. The simplest case of this is capital Z vs. lowercase a.[snip] 3. Collation is not a property of strings. Consider a list of cities, with each city correctly tagged with its language. Despite this, a German user will expect to see the cities all sorted according to German order, and not expect to see a word with ö appear after z, simply because the city has a Swedish name. Of crucial importance is that if a German businessman makes a database selection, such as to sum up revenue in each of of the cities from O... to P... for planning purposes, then cities starting with Ö must not be excluded. 4. Collation order is not preserved under concatenation or substring operations, in general. For example, the fact that x is less than y does not mean that x + z is less than y + z. This is because characters may form contractions across the substring or concatenation boundaries. In summary, the following shows which implications not to expect.

[snip]

Regardless, according to this what happens when I try to use the unicode collation algorithm to compare 'á' and 'arc' it looks like 'á' should appear before it:

00E0 ; [.0E33.0020.0002.0061][.0000.0035.0002.0300] # LATIN SMALL LETTER A WITH GRAVE; QQCM 0061 ; [.0E33.0020.0002.0061] # LATIN SMALL LETTER A 0072 ; [.0FC0.0020.0002.0072] # LATIN SMALL LETTER R 0063 ; [.0E60.0020.0002.0063] # LATIN SMALL LETTER C

the string 'a' has a sort key (assuming "regular" algorithmic collation with no localized collation tables etc...) of:

0E33 0000	0020 0000	0002 0000

'á' has a sort key of:

0E33 0000	0020 0035 0000	0002 0002 0000

'arc' has a sort key of:

0E33 0FC0 0E60 0000	0020 0020 0020 0000 	0002 0002 0002 0000

As a result, the ordering should be 'a', 'á', 'arc'

Glenn Andreas                      email@hidden 
 <http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



References:  
  >Ignore accents when comparing strings (From: "Simon alias Trax" <email@hidden>)
  >Re: Ignore accents when comparing strings (From: Andrew Farmer <email@hidden>)
  >Re: Ignore accents when comparing strings (From: Kevin Ballard <email@hidden>)
  >Re: Ignore accents when comparing strings (From: Brendan Younger <email@hidden>)
  >Re: Ignore accents when comparing strings (From: "R.Claeson" <email@hidden>)




Prev by Date:
Re: IB hierarchy question 

Next by Date:
Unable To Refersh NSView after maximizing from minimized state

Previous by thread:
Re: Ignore accents when comparing strings

Next by thread:
Re: Ignore accents when comparing strings

Index(es):

Date
Thread