• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Unicode case conversion
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode case conversion


  • Subject: Re: Unicode case conversion
  • From: Glenn Andreas <email@hidden>
  • Date: Thu, 25 Nov 2004 10:44:06 -0600

At 9:37 PM -0700 11/24/04, Robbie Haertel wrote:
Levenshtein edit distance of a Mayan language.  Have to compare each
character one-by-one.  The old Spanish priest often writes b, u, and w
as 'V', but this is one of the few cases (there are a few others) I
want to change the case.  I'm already necessarily comparing
character-by-character due to the algorithm, so it isn't a problem.  I
can already guarantee that there will be no fancy characters other
than "option-3" (the English pound symbol).  It may seem like just
checking for 'V' is an option, but it is more complicated than that.

There are some carbon functions, I believe, but I don't know anything
about carbon.  Also, I think there are some functions for wide
characters, but I don't think it is the same thing.

Thanks,
robbie


Interesting.

If the only fancy character (i.e., non-ascii) is the English pound symbol, you can just use regular old C style lower, since you'll only have ascii letters. If you have to worry about things like accented characters (and I'm assuming even old Spanish would have them) you'll have to go further.

What I'd do is take advantage of Obj-C++ and create a map to cache the results of converting via the "lowercase" method where the result is a single character, something like:

std::map<unichar, unichar> lowerMap;
NSCharacterSet *upperSet = [NSCharacterSet uppercaseLetterCharacterSet];
for (unsigned i=0;i<[str length];i++) {
...
unichar c = [str characterAtIndex: i];
std::map<unichar, unichar>::iterator m = lowerMap.find(c);
if (m != lowerMap.end()) {
c = m->second; // get what we mapped into
} else {
if ([upperSet characterIsMember: c]) {
NSString *lowerStr = [[str substringFromRange: NSMakeRange(i,1)] lower];
if ([lowerStr length] == 1) { // converted to a single character
unichar lowerC = [lowerStr characterAtIndex: 0];
lowerMap[c] = lowerC;
c = lowerC;
} else {
// lower case version isn't a single character, keep as uppercase
lowerMap[c] = c;
}
} else {
// not an uppercase, leave as is, or do something else
lowerMap[c] = c; // but enter into map for the next time we see it
}
}
// c is now in your "canonical form"
...
}


The other advantage of this is that you can perform other canonicalization - say this was Latin and you wanted to canonicalize I/J and U/V, you could just add before the for loop:
lowerMap['V'] = 'u';
lowerMap['v'] = 'u';
lowerMap['I'] = 'j';
lowerMap['i'] = 'j';



You can fill in that map with as many special cases as you want as well (so you could for "(char i='A';i<='Z';i++) lowerMap[i] = i;" before putting in the special case handling for "V" and then everything else will stay as uppercase).



--
Glenn Andreas email@hidden <http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden
References: 
 >Unicode case conversion (From: Robbie Haertel <email@hidden>)
 >Re: Unicode case conversion (From: Robbie Haertel <email@hidden>)

  • Prev by Date: Re: Performance with keyed archives
  • Next by Date: Re: getting a list of connected displays
  • Previous by thread: Re: Unicode case conversion
  • Next by thread: Re: Unicode case conversion
  • Index(es):
    • Date
    • Thread