Re: Searching for "whole word" in NSString
Re: Searching for "whole word" in NSString
- Subject: Re: Searching for "whole word" in NSString
- From: John Stiles <email@hidden>
- Date: Wed, 06 Feb 2008 09:25:17 -0800
Well, in practical terms, it looks like AppKit doesn't try very hard for
Japanese:
- When double-clicking, any Kanji or hiragana is treated as its own
"word." e.g. if I double click on "東京大学院生" I get a single kanji
selected, or if I double click on "こんにちは" I get a single kana
selected. Any consecutive katakana do count as one word, so "マウスク
リック" or "オフィスレーディー” are fully selected on double-click, even
though the IME actually considers them to be two separate words.
- Search results with "Full word" enabled seem to follow this same
pattern, as nonsensical as it may seem. I can't find the full word "東
京” in "東京大学院生", but I can find "東” just fine. I can't find "マウ
ス", but I can find "マウスクリック”.
So while your code may not work for all languages, AppKit does not seem
to do a better job. If anything I'd call AppKit's behavior pretty broken.
This is all using Leopard 10.5.1, by the way.
I might look into UCFindTextBreak since it seems to exist all the way
back to 10.0 and will probably work just fine for me. In actuality I
don't think I need to support Asian languages that well anyway—in my
case, this is for editing code/scripts, not freeform text. While Asian
input is probably going to be supported, it should only be used in
comments or string literals, and I don't expect that users will have
high expectations for searching within it. (And it looks like the bar is
set appropriately low for me anyway ;)  )
Mike Wright wrote:
On Feb. 5, 2008, at 22:30 , Deborah Goldsmith wrote:
This doesn't work for all languages. What constitutes a "word" is
rather more complex than this. In Thai, for a particularly egregious
example, you can't find word boundaries without looking up the words
in a dictionary.
Whole-word searching seems pretty unnecessary (and virtually
impossible) for Japanese and Chinese. Does it really make sense for
Thai? There are lots of skeptics regarding "word" even being a valid
concept in reference Chinese languages. San Duanmu (The Phonology of
Standard Chinese) makes a good case for the concept in Mandarin, at
least, but I can't see any way that it could be used as a basis for
whole-word text searches. And long strings of hiragana in Japanese
seem to require human intuition to find word breaks. (And better
intuition than mine.)
On Tiger, you can use either the double-click API in Cocoa, or
UCFindTextBreak, to find word boundaries. On Leopard or later, use
CFStringTokenizer. All of them will do the right thing for word
boundaries in every language we support.
Is there a list somewhere of the supported languages? (I assume you
mean supported by those APIs, and that writing systems like Japanese
and Chinese that don't include some set of word delimiters are
excluded. And Thai and other Brahmi-derived scripts?)
Did you happen to see my response to Douglas Davidson the next day
(Jan 30, 2008 Message-ID:
<email@hidden>)? Here's a
restatement of it:
From my perspective, the problem is that the "whole words" to be
searched for are not always words in any linguistic sense. Judging by
the TextEdit Find panel, the double-click API doesn't seem to be
capable of treating "a:" as a word, but it's just the kind of thing
that I might want to perform a whole-word search for, trying to find
something like " a: " or "\na:\n" in a mishmash of strings like
"5a:--w7".
And, as John Stiles pointed out somewhere, the Text Edit Find panel
can't find something like "way home" doing a Full Word search in text
containing: I don't know the way home. Go away homewrecker. Look at
the way Homer ran.
My method can find the desired target text in both of those cases--at
least in English, and presumably in other Roman-based scripts. And,
it's pretty easy to change the set of word delimiters.
So, maybe "whole phrase" is a more accurate term than "whole word",
but it's the kind of behavior that I expect--and that I think my
customers expect. (My customers aren't real big on providing feedback
as long as they're happy, but I figure no news is good news.)
Will UCFindTextBreak do any better in this kind of case? Or
CFStringTokenizer? (Although, given my customer base, I don't expect
to use any Leopard-only APIs for a long time.)
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
Deborah Goldsmith
Apple Inc.
email@hidden
On Jan 29, 2008, at 12:28 PM, Mike Wright wrote:
On Jan 29, 2008, at 10:12:21 -0800, John Stiles
<email@hidden> wrote:
I'm trying to find a substring in an NSString. But I want to find
whole
words (e.g. like in the Find panel when you choose "Full word" from
the
popup, rather than "Contains" or "Starts With").
Unless I'm missing something, it looks like NSString's
-rangeOfString:options:range:locale: doesn't have an option for
finding
whole words.
How does the Find panel do it, then? Am I going to have to "roll my
own"
code for string searching? That sounds error-prone to me; I'd much
rather have the OS do it.
Here's a Tiger approach that's worked pretty well for me (or, at
least, no non-English-using customers have complained--so far).
NSString *fieldContent; // the string I'm searching in
NSString *targetString; // the string to be found
NSRange hitRange; // the range of targetString found within
fieldContent
NSRange testRange; // in the beginning, this covers all of fieldContent
BOOL caseSensitive; // specified by the user
BOOL isWholeWord = NO; // this is used in two sequential tests
// set up the search mask
unsigned searchMask = NSLiteralSearch;
if (! caseSensitive)
    searchMask |= NSCaseInsensitiveSearch;
// set up the character set for words
NSCharacterSet *wordCharacterSet = [NSCharacterSet
alphanumericCharacterSet];
// look for targetString in fieldContent
hitRange = [fieldContent rangeOfString:targetString options:
searchMask range:testRange];
// if we found something, do the whole-word test
if (hitRange.location != NSNotFound)
{
    // test the beginning of targetString
    isWholeWord = ((hitRange.location == 0) || (! [wordCharacterSet
characterIsMember:[fieldContent characterAtIndex:(hitRange.location
- 1)]]));
    // if the beginning is okay, test the end of targetString
    if (isWholeWord)
    {
        unsigned nextCharPosition = hitRange.location +
hitRange.length;
        isWholeWord = ((nextCharPosition == [fieldContent length])
|| (! [wordCharacterSet characterIsMember:[fieldContent
characterAtIndex:nextCharPosition]]));
    }
}
Finally:
if (isWholeWord)
{
    // show it to the user
}
Hope this helps. (And, since it's not just copied from my own code,
I hope it doesn't contain any serious errors.)
Regards,
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden