Re: Searching for "whole word" in NSString
Re: Searching for "whole word" in NSString
- Subject: Re: Searching for "whole word" in NSString
- From: Deborah Goldsmith <email@hidden>
- Date: Wed, 06 Feb 2008 19:03:53 -0800
Comments below.
Deborah Goldsmith
Apple Inc.
email@hidden
On Feb 5, 2008, at 9:27 PM, Mike Wright wrote:
On Feb. 5, 2008, at 22:30 , Deborah Goldsmith wrote:
This doesn't work for all languages. What constitutes a "word" is
rather more complex than this. In Thai, for a particularly
egregious example, you can't find word boundaries without looking
up the words in a dictionary.
Whole-word searching seems pretty unnecessary (and virtually
impossible) for Japanese and Chinese. Does it really make sense for
Thai? There are lots of skeptics regarding "word" even being a valid
concept in reference Chinese languages. San Duanmu (The Phonology of
Standard Chinese) makes a good case for the concept in Mandarin, at
least, but I can't see any way that it could be used as a basis for
whole-word text searches. And long strings of hiragana in Japanese
seem to require human intuition to find word breaks. (And better
intuition than mine.)
It works fine for Thai. CFStringTokenizer also contains dictionary-
based tokenizers for Japanese and Chinese.
On Tiger, you can use either the double-click API in Cocoa, or
UCFindTextBreak, to find word boundaries. On Leopard or later, use
CFStringTokenizer. All of them will do the right thing for word
boundaries in every language we support.
Is there a list somewhere of the supported languages? (I assume you
mean supported by those APIs, and that writing systems like Japanese
and Chinese that don't include some set of word delimiters are
excluded. And Thai and other Brahmi-derived scripts?)
Only CFStringTokenizer does dictionary-based analysis of Japanese and
Chinese. Thai works with any of the APIs. Other Southeast Asian
scripts (e.g. Khmer, etc.) are not currently supported.
Did you happen to see my response to Douglas Davidson the next day
(Jan 30, 2008 Message-ID: <email@hidden
>)? Here's a restatement of it:
From my perspective, the problem is that the "whole words" to be
searched for are not always words in any linguistic sense. Judging
by the TextEdit Find panel, the double-click API doesn't seem to be
capable of treating "a:" as a word, but it's just the kind of thing
that I might want to perform a whole-word search for, trying to find
something like " a: " or "\na:\n" in a mishmash of strings like
"5a:--w7".
And, as John Stiles pointed out somewhere, the Text Edit Find panel
can't find something like "way home" doing a Full Word search in
text containing: I don't know the way home. Go away homewrecker.
Look at the way Homer ran.
My method can find the desired target text in both of those cases--
at least in English, and presumably in other Roman-based scripts.
And, it's pretty easy to change the set of word delimiters.
So, maybe "whole phrase" is a more accurate term than "whole word",
but it's the kind of behavior that I expect--and that I think my
customers expect. (My customers aren't real big on providing
feedback as long as they're happy, but I figure no news is good news.)
Will UCFindTextBreak do any better in this kind of case? Or
CFStringTokenizer? (Although, given my customer base, I don't expect
to use any Leopard-only APIs for a long time.)
The definition of "word" is based on the Unicode Standard, as defined
in Unicode Standard Annex 29 Text Boundaries.
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
Deborah Goldsmith
Apple Inc.
email@hidden
On Jan 29, 2008, at 12:28 PM, Mike Wright wrote:
On Jan 29, 2008, at 10:12:21 -0800, John Stiles <email@hidden
> wrote:
I'm trying to find a substring in an NSString. But I want to find
whole
words (e.g. like in the Find panel when you choose "Full word"
from the
popup, rather than "Contains" or "Starts With").
Unless I'm missing something, it looks like NSString's
-rangeOfString:options:range:locale: doesn't have an option for
finding
whole words.
How does the Find panel do it, then? Am I going to have to "roll
my own"
code for string searching? That sounds error-prone to me; I'd much
rather have the OS do it.
Here's a Tiger approach that's worked pretty well for me (or, at
least, no non-English-using customers have complained--so far).
NSString *fieldContent; // the string I'm searching in
NSString *targetString; // the string to be found
NSRange hitRange; // the range of targetString found within
fieldContent
NSRange testRange; // in the beginning, this covers all of
fieldContent
BOOL caseSensitive; // specified by the user
BOOL isWholeWord = NO; // this is used in two sequential tests
// set up the search mask
unsigned searchMask = NSLiteralSearch;
if (! caseSensitive)
searchMask |= NSCaseInsensitiveSearch;
// set up the character set for words
NSCharacterSet *wordCharacterSet = [NSCharacterSet
alphanumericCharacterSet];
// look for targetString in fieldContent
hitRange = [fieldContent rangeOfString:targetString options:
searchMask range:testRange];
// if we found something, do the whole-word test
if (hitRange.location != NSNotFound)
{
// test the beginning of targetString
isWholeWord = ((hitRange.location == 0) || (! [wordCharacterSet
characterIsMember:[fieldContent characterAtIndex:
(hitRange.location - 1)]]));
// if the beginning is okay, test the end of targetString
if (isWholeWord)
{
unsigned nextCharPosition = hitRange.location + hitRange.length;
isWholeWord = ((nextCharPosition == [fieldContent length]) || (!
[wordCharacterSet characterIsMember:[fieldContent
characterAtIndex:nextCharPosition]]));
}
}
Finally:
if (isWholeWord)
{
// show it to the user
}
Hope this helps. (And, since it's not just copied from my own
code, I hope it doesn't contain any serious errors.)
Regards,
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
@apple.com
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden