Re: Searching for "whole word" in NSString
Re: Searching for "whole word" in NSString
- Subject: Re: Searching for "whole word" in NSString
- From: Aki Inoue <email@hidden>
- Date: Wed, 6 Feb 2008 14:19:01 -0800
John,
Right now AppKit is using UCFindTextBreak under the cover and the
Unicode Utility API function is, in turn, implemented on top of ICU.
- When double-clicking, any Kanji or hiragana is treated as its own
"word." e.g. if I double click on "東京大学院生" I get a single
kanji selected, or if I double click on "こんにちは" I get a
single kana selected. Any consecutive katakana do count as one word,
so "マウスク リック" or "オフィスレーディー” are
fully selected on double-click, even though the IME actually
considers them to be two separate words.
This is the default word breaking behavior for non-Japanese locales.
With Japanese locale (you can specify in International Pref's Word
Break popup), UCFindTextBreak uses slightly more Japanese friendly
algorithm.
We have plans to enhance the AppKit user experience by integrating
CFStringTokenizer in the future.
Thanks for your inputs,
Aki
On 2008/02/06, at 9:25, John Stiles wrote:
Well, in practical terms, it looks like AppKit doesn't try very hard
for Japanese:
- When double-clicking, any Kanji or hiragana is treated as its own
"word." e.g. if I double click on "東京大学院生" I get a single
kanji selected, or if I double click on "こんにちは" I get a
single kana selected. Any consecutive katakana do count as one word,
so "マウスク リック" or "オフィスレーディー” are
fully selected on double-click, even though the IME actually
considers them to be two separate words.
- Search results with "Full word" enabled seem to follow this same
pattern, as nonsensical as it may seem. I can't find the full word
"東 京” in "東京大学院生", but I can find
"東” just fine. I can't find "マウ ス", but I can
find "マウスクリック”.
So while your code may not work for all languages, AppKit does not
seem to do a better job. If anything I'd call AppKit's behavior
pretty broken.
This is all using Leopard 10.5.1, by the way.
I might look into UCFindTextBreak since it seems to exist all the
way back to 10.0 and will probably work just fine for me. In
actuality I don't think I need to support Asian languages that well
anyway―in my case, this is for editing code/scripts, not
freeform text. While Asian input is probably going to be supported,
it should only be used in comments or string literals, and I don't
expect that users will have high expectations for searching within
it. (And it looks like the bar is set appropriately low for me
anyway ;) )
Mike Wright wrote:
On Feb. 5, 2008, at 22:30 , Deborah Goldsmith wrote:
This doesn't work for all languages. What constitutes a "word" is
rather more complex than this. In Thai, for a particularly
egregious example, you can't find word boundaries without looking
up the words in a dictionary.
Whole-word searching seems pretty unnecessary (and virtually
impossible) for Japanese and Chinese. Does it really make sense for
Thai? There are lots of skeptics regarding "word" even being a
valid concept in reference Chinese languages. San Duanmu (The
Phonology of Standard Chinese) makes a good case for the concept in
Mandarin, at least, but I can't see any way that it could be used
as a basis for whole-word text searches. And long strings of
hiragana in Japanese seem to require human intuition to find word
breaks. (And better intuition than mine.)
On Tiger, you can use either the double-click API in Cocoa, or
UCFindTextBreak, to find word boundaries. On Leopard or later, use
CFStringTokenizer. All of them will do the right thing for word
boundaries in every language we support.
Is there a list somewhere of the supported languages? (I assume you
mean supported by those APIs, and that writing systems like
Japanese and Chinese that don't include some set of word delimiters
are excluded. And Thai and other Brahmi-derived scripts?)
Did you happen to see my response to Douglas Davidson the next day
(Jan 30, 2008 Message-ID: <email@hidden
>)? Here's a restatement of it:
From my perspective, the problem is that the "whole words" to be
searched for are not always words in any linguistic sense. Judging
by the TextEdit Find panel, the double-click API doesn't seem to be
capable of treating "a:" as a word, but it's just the kind of thing
that I might want to perform a whole-word search for, trying to
find something like " a: " or "\na:\n" in a mishmash of strings
like "5a:--w7".
And, as John Stiles pointed out somewhere, the Text Edit Find panel
can't find something like "way home" doing a Full Word search in
text containing: I don't know the way home. Go away homewrecker.
Look at the way Homer ran.
My method can find the desired target text in both of those cases--
at least in English, and presumably in other Roman-based scripts.
And, it's pretty easy to change the set of word delimiters.
So, maybe "whole phrase" is a more accurate term than "whole word",
but it's the kind of behavior that I expect--and that I think my
customers expect. (My customers aren't real big on providing
feedback as long as they're happy, but I figure no news is good
news.)
Will UCFindTextBreak do any better in this kind of case? Or
CFStringTokenizer? (Although, given my customer base, I don't
expect to use any Leopard-only APIs for a long time.)
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
Deborah Goldsmith
Apple Inc.
email@hidden
On Jan 29, 2008, at 12:28 PM, Mike Wright wrote:
On Jan 29, 2008, at 10:12:21 -0800, John Stiles <email@hidden
> wrote:
I'm trying to find a substring in an NSString. But I want to
find whole
words (e.g. like in the Find panel when you choose "Full word"
from the
popup, rather than "Contains" or "Starts With").
Unless I'm missing something, it looks like NSString's
-rangeOfString:options:range:locale: doesn't have an option for
finding
whole words.
How does the Find panel do it, then? Am I going to have to "roll
my own"
code for string searching? That sounds error-prone to me; I'd much
rather have the OS do it.
Here's a Tiger approach that's worked pretty well for me (or, at
least, no non-English-using customers have complained--so far).
NSString *fieldContent; // the string I'm searching in
NSString *targetString; // the string to be found
NSRange hitRange; // the range of targetString found within
fieldContent
NSRange testRange; // in the beginning, this covers all of
fieldContent
BOOL caseSensitive; // specified by the user
BOOL isWholeWord = NO; // this is used in two sequential tests
// set up the search mask
unsigned searchMask = NSLiteralSearch;
if (! caseSensitive)
searchMask |= NSCaseInsensitiveSearch;
// set up the character set for words
NSCharacterSet *wordCharacterSet = [NSCharacterSet
alphanumericCharacterSet];
// look for targetString in fieldContent
hitRange = [fieldContent rangeOfString:targetString options:
searchMask range:testRange];
// if we found something, do the whole-word test
if (hitRange.location != NSNotFound)
{
// test the beginning of targetString
isWholeWord = ((hitRange.location == 0) || (!
[wordCharacterSet characterIsMember:[fieldContent
characterAtIndex:(hitRange.location - 1)]]));
// if the beginning is okay, test the end of targetString
if (isWholeWord)
{
unsigned nextCharPosition = hitRange.location +
hitRange.length;
isWholeWord = ((nextCharPosition == [fieldContent length])
|| (! [wordCharacterSet characterIsMember:[fieldContent
characterAtIndex:nextCharPosition]]));
}
}
Finally:
if (isWholeWord)
{
// show it to the user
}
Hope this helps. (And, since it's not just copied from my own
code, I hope it doesn't contain any serious errors.)
Regards,
Mike Wright
http://www.idata3.com/
http://www.raccoonbend.com/
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the
list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden