Re: parsing a string into words
Re: parsing a string into words
- Subject: Re: parsing a string into words
- From: Deborah Goldsmith <email@hidden>
- Date: Mon, 04 May 2009 16:39:41 -0700
As much as I enjoy languages (I've taken a few
in college, and 10 years ago I was on a couple
Unicode e-mailing lists mainly to read the
interesting discussion about the differences),
for now I'll stick with the US+Euro+Japanese+
Latin+Hebrew solution that I have, that uses
"Separated by whitespace" will not gives you words in Japanese, as
Japanese doesn't use whitespace to separate words, either (neither
does Chinese). You need to do morphological analysis in Japanese to
determine what the words are.
Deborah Goldsmith
Apple Inc.
email@hidden
On Apr 26, 2009, at 10:50 AM, Jeffrey Oleander wrote:
At Sun, 2009-04-26, 09:01, Alastair Houghton <email@hidden
> wrote:
At 2009 Apr 26, 04:33, Jeffrey Oleander wrote:
NSArray * tokens = [string
componentsSeparatedByCharactersInSet:
whitespaceCharacterSet];
No, no, no. If you read Gerriet's original post,
you would have noticed that he even explained
that what you just said won't work, because not
all languages use whitespace to separate words
like English does.
You probably want to be using CFStringTokenizer(),
at least on OS X. For cross-platform code,
ICU is probably your best bet.
Thanks for that info and the pointer to the ICU.
http://userguide.icu-project.org/boundaryanalysis
BreakIterator
Character, Word, Line or Sentence
"you provide an appropriate CharacterIterator"
UChar *
I was half expecting that response because I was
aware that "not all languages use white space to
separate words", but hoping for some magic in
NSString.
Unfortunately, CFStringTokenizer is not available
in 10.3.9, and no, I do not have a chest of silver
or gold behind my pillow to run around buying
newer hardware and software, let alone doing so
every 2 years; we're in re-boot-strapping mode
in the land of the globalized Bush-Clinton-Bush-Obama
depression.
This makes ICU suspect:
"Copyright (c) 2000 - 2008 IBM and Others"
Is Apple one of the "Others"? The "using ICU"
list is a mixed bag of reputable firms and
unethical rogues, and I don't see any
additional info on who is behind "ICU".
As much as I enjoy languages (I've taken a few
in college, and 10 years ago I was on a couple
Unicode e-mailing lists mainly to read the
interesting discussion about the differences),
for now I'll stick with the US+Euro+Japanese+
Latin+Hebrew solution that I have, that uses
Objective-C and doesn't drag me into the
complications of Objective-C++ and transferring
data around to different stores based on
different programming language and framework
conventions, that I can immediately use,
can trust, and seems amenable to reasonable
later modification to handle the remote out-liers.
Onward.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden