Re: finding substring
Re: finding substring
- Subject: Re: finding substring
- From: Aki Inoue <email@hidden>
- Date: Fri, 31 Mar 2006 22:55:45 -0800
At 5:42 PM -0800 3/31/06, Aki Inoue wrote:
Chuck,
1. Is this a good assumption?
It is not universal. For majority of scripts, accents are
essential that stripping them changes the meaning.
This answer makes sense from a programmer's perspective, but from a
user's perspective it might be confusing. For example, if someone
searches for "San Jose", the results include San Jose, California
but not San José, Costa Rica.
My English atlas shows San Jose, California and San José, Costa
Rica. I suspect that most users think of the two city names as being
the same, but they're not.
Do you think that striping diacritical marks makes sense when
comparing some geographical names/languages, but not all, such as
localized Japanese names?
Probably my choice of the word "scripts" led to misunderstand, but I
meant the word referring to the natural language, not programming
languages.
Typically, for languages using the Latin scripts, the diacritics are
considered to be optional; however, they are essential for other
languages/scripts (i.e. Japanese). Even Vietnamese, a Latin-script
language, requires diacritics.
If so, is there a way to make a distinction?
Right now, there is no clear standard defining the algorithm. There
is an effort by Asmus Freytag @ Unicode. A draft technical report http://www.unicode.org/reports/tr30/
has diacritics removal defined. Essentially, you want to remove
diacritics if the base character is Latin/Greek/Cyrillic.
2. What is the best way to find a sub-string and ignore
diacritical marks?
You could strip them by using -[NSString
decomposedStringWithCanonicalMapping] and +[NSCharacterSet
nonBaseCharacterSet].
I don't understand this suggestion. The following code returns "San
José" and I was expecting it to return "San Jose":
NSLog(@"noDiacrit %@", [@"San José"
decomposedStringWithCanonicalMapping]);
Could you possibly show a code snippet with the searchStr and
placeName variables (used in my code sample)?
My suggestion is to pre-process the place name so that diacritics are
removed before hand. By using -decomposedStringWithCanonicalMapping,
you can have fully decomposed string (diacritics are represented as
independent characters from the base). Then, using
+nonBaseCharacterSet, you can find the accent characters and manually
remove it.
A simple snippet.
NSMutableString *mString = [[string
decomposedStringWithCanonicalMapping] mutableCopy];
NSCharacterSet *nonBaseSet = [NSCharacterSet nonBaseCharacterSet];
NSRange range = NSMakeRange([mString length], 0);
while (range.location > 0) {
range = [mString rangeOfCharacterFromSet:nonBaseSet
options:NSBackwardsSearch range:NSMakeRange(0, range.location)];
if (range.length == 0) break;
[mString deleteCharactersInRange:range];
}
Aki _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden