• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: finding substring
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: finding substring


  • Subject: Re: finding substring
  • From: Aki Inoue <email@hidden>
  • Date: Fri, 31 Mar 2006 22:55:45 -0800

At 5:42 PM -0800 3/31/06, Aki Inoue wrote:
Chuck,

1. Is this a good assumption?
It is not universal. For majority of scripts, accents are essential that stripping them changes the meaning.

This answer makes sense from a programmer's perspective, but from a user's perspective it might be confusing. For example, if someone searches for "San Jose", the results include San Jose, California but not San José, Costa Rica.


My English atlas shows San Jose, California and San José, Costa Rica. I suspect that most users think of the two city names as being the same, but they're not.

Do you think that striping diacritical marks makes sense when comparing some geographical names/languages, but not all, such as localized Japanese names?
Probably my choice of the word "scripts" led to misunderstand, but I meant the word referring to the natural language, not programming languages.

Typically, for languages using the Latin scripts, the diacritics are considered to be optional; however, they are essential for other languages/scripts (i.e. Japanese). Even Vietnamese, a Latin-script language, requires diacritics.

If so, is there a way to make a distinction?
Right now, there is no clear standard defining the algorithm. There is an effort by Asmus Freytag @ Unicode. A draft technical report http://www.unicode.org/reports/tr30/ has diacritics removal defined. Essentially, you want to remove diacritics if the base character is Latin/Greek/Cyrillic.

2. What is the best way to find a sub-string and ignore diacritical marks?
You could strip them by using -[NSString decomposedStringWithCanonicalMapping] and +[NSCharacterSet nonBaseCharacterSet].

I don't understand this suggestion. The following code returns "San José" and I was expecting it to return "San Jose":
NSLog(@"noDiacrit %@", [@"San José" decomposedStringWithCanonicalMapping]);


Could you possibly show a code snippet with the searchStr and placeName variables (used in my code sample)?
My suggestion is to pre-process the place name so that diacritics are removed before hand. By using -decomposedStringWithCanonicalMapping, you can have fully decomposed string (diacritics are represented as independent characters from the base). Then, using +nonBaseCharacterSet, you can find the accent characters and manually remove it.

A simple snippet.

NSMutableString *mString = [[string decomposedStringWithCanonicalMapping] mutableCopy];
NSCharacterSet *nonBaseSet = [NSCharacterSet nonBaseCharacterSet];
NSRange range = NSMakeRange([mString length], 0);


while (range.location > 0) {
range = [mString rangeOfCharacterFromSet:nonBaseSet options:NSBackwardsSearch range:NSMakeRange(0, range.location)];
if (range.length == 0) break;
[mString deleteCharactersInRange:range];
}


Aki _______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: finding substring
      • From: Chuck Soper <email@hidden>
References: 
 >finding substring (From: Chuck Soper <email@hidden>)
 >Re: finding substring (From: Aki Inoue <email@hidden>)
 >Re: finding substring (From: Chuck Soper <email@hidden>)

  • Prev by Date: Re: memory problem, advice needed
  • Next by Date: Re: finding substring
  • Previous by thread: Re: finding substring
  • Next by thread: Re: finding substring
  • Index(es):
    • Date
    • Thread