Re: Unicode case conversion
Re: Unicode case conversion
- Subject: Re: Unicode case conversion
- From: Glenn Andreas <email@hidden>
- Date: Thu, 25 Nov 2004 10:44:06 -0600
At 9:37 PM -0700 11/24/04, Robbie Haertel wrote:
Levenshtein edit distance of a Mayan language. Have to compare each
character one-by-one. The old Spanish priest often writes b, u, and w
as 'V', but this is one of the few cases (there are a few others) I
want to change the case. I'm already necessarily comparing
character-by-character due to the algorithm, so it isn't a problem. I
can already guarantee that there will be no fancy characters other
than "option-3" (the English pound symbol). It may seem like just
checking for 'V' is an option, but it is more complicated than that.
There are some carbon functions, I believe, but I don't know anything
about carbon. Also, I think there are some functions for wide
characters, but I don't think it is the same thing.
Thanks,
robbie
Interesting.
If the only fancy character (i.e., non-ascii) is the English pound
symbol, you can just use regular old C style lower, since you'll only
have ascii letters. If you have to worry about things like accented
characters (and I'm assuming even old Spanish would have them) you'll
have to go further.
What I'd do is take advantage of Obj-C++ and create a map to cache
the results of converting via the "lowercase" method where the result
is a single character, something like:
std::map<unichar, unichar> lowerMap;
NSCharacterSet *upperSet = [NSCharacterSet
uppercaseLetterCharacterSet];
for (unsigned i=0;i<[str length];i++) {
...
unichar c = [str characterAtIndex: i];
std::map<unichar, unichar>::iterator m = lowerMap.find(c);
if (m != lowerMap.end()) {
c = m->second; // get what we mapped into
} else {
if ([upperSet characterIsMember: c]) {
NSString *lowerStr = [[str
substringFromRange: NSMakeRange(i,1)] lower];
if ([lowerStr length] == 1) { //
converted to a single character
unichar lowerC = [lowerStr
characterAtIndex: 0];
lowerMap[c] = lowerC;
c = lowerC;
} else {
// lower case version isn't a
single character, keep as uppercase
lowerMap[c] = c;
}
} else {
// not an uppercase, leave as is, or
do something else
lowerMap[c] = c; // but enter into
map for the next time we see it
}
}
// c is now in your "canonical form"
...
}
The other advantage of this is that you can perform other
canonicalization - say this was Latin and you wanted to canonicalize
I/J and U/V, you could just add before the for loop:
lowerMap['V'] = 'u';
lowerMap['v'] = 'u';
lowerMap['I'] = 'j';
lowerMap['i'] = 'j';
You can fill in that map with as many special cases as you want as
well (so you could for "(char i='A';i<='Z';i++) lowerMap[i] = i;"
before putting in the special case handling for "V" and then
everything else will stay as uppercase).
--
Glenn Andreas email@hidden
<http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Cocoa-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden