Re: finding substring
Re: finding substring
- Subject: Re: finding substring
- From: Chuck Soper <email@hidden>
- Date: Fri, 31 Mar 2006 23:36:06 -0800
At 10:55 PM -0800 3/31/06, Aki Inoue wrote:
At 5:42 PM -0800 3/31/06, Aki Inoue wrote:
Chuck,
1. Is this a good assumption?
It is not universal. For majority of scripts,
accents are essential that stripping them
changes the meaning.
This answer makes sense from a programmer's
perspective, but from a user's perspective it
might be confusing. For example, if someone
searches for "San Jose", the results include
San Jose, California but not San José, Costa
Rica.
My English atlas shows San Jose, California and
San José, Costa Rica. I suspect that most users
think of the two city names as being the same,
but they're not.
Do you think that striping diacritical marks
makes sense when comparing some geographical
names/languages, but not all, such as localized
Japanese names?
Probably my choice of the word "scripts" led to
misunderstand, but I meant the word referring to
the natural language, not programming languages.
Typically, for languages using the Latin
scripts, the diacritics are considered to be
optional; however, they are essential for other
languages/scripts (i.e. Japanese). Even
Vietnamese, a Latin-script language, requires
diacritics.
This makes sense. Thanks for the explanation.
If so, is there a way to make a distinction?
Right now, there is no clear standard defining
the algorithm. There is an effort by Asmus
Freytag @ Unicode. A draft technical report
http://www.unicode.org/reports/tr30/ has
diacritics removal defined. Essentially, you
want to remove diacritics if the base character
is Latin/Greek/Cyrillic.
Very interesting.
2. What is the best way to find a sub-string and ignore diacritical marks?
You could strip them by using -[NSString
decomposedStringWithCanonicalMapping] and
+[NSCharacterSet nonBaseCharacterSet].
I don't understand this suggestion. The
following code returns "San José" and I was
expecting it to return "San Jose":
NSLog(@"noDiacrit %@", [@"San José" decomposedStringWithCanonicalMapping]);
Could you possibly show a code snippet with the
searchStr and placeName variables (used in my
code sample)?
My suggestion is to pre-process the place name
so that diacritics are removed before hand. By
using -decomposedStringWithCanonicalMapping, you
can have fully decomposed string (diacritics are
represented as independent characters from the
base). Then, using +nonBaseCharacterSet, you
can find the accent characters and manually
remove it.
A simple snippet.
NSMutableString *mString = [[string
decomposedStringWithCanonicalMapping]
mutableCopy];
NSCharacterSet *nonBaseSet = [NSCharacterSet nonBaseCharacterSet];
NSRange range = NSMakeRange([mString length], 0);
while (range.location > 0) {
range = [mString
rangeOfCharacterFromSet:nonBaseSet
options:NSBackwardsSearch range:NSMakeRange(0,
range.location)];
if (range.length == 0) break;
[mString deleteCharactersInRange:range];
}
Aki
I now understand. Thanks for your detailed explanation and source.
Chuck
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden