• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: parsing a string into words
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parsing a string into words


  • Subject: Re: parsing a string into words
  • From: Deborah Goldsmith <email@hidden>
  • Date: Mon, 04 May 2009 16:39:41 -0700

As much as I enjoy languages (I've taken a few
in college, and 10 years ago I was on a couple
Unicode e-mailing lists mainly to read the
interesting discussion about the differences),
for now I'll stick with the US+Euro+Japanese+
Latin+Hebrew solution that I have, that uses


"Separated by whitespace" will not gives you words in Japanese, as Japanese doesn't use whitespace to separate words, either (neither does Chinese). You need to do morphological analysis in Japanese to determine what the words are.

Deborah Goldsmith
Apple Inc.
email@hidden

On Apr 26, 2009, at 10:50 AM, Jeffrey Oleander wrote:


At Sun, 2009-04-26, 09:01, Alastair Houghton <email@hidden > wrote:
At 2009 Apr 26, 04:33, Jeffrey Oleander wrote:
NSArray * tokens = [string
componentsSeparatedByCharactersInSet:
whitespaceCharacterSet];

No, no, no.  If you read Gerriet's original post,
you would have noticed that he even explained
that what you just said won't work, because not
all languages use whitespace to separate words
like English does.

You probably want to be using CFStringTokenizer(),
at least on OS X.  For cross-platform code,
ICU is probably your best bet.

Thanks for that info and the pointer to the ICU. http://userguide.icu-project.org/boundaryanalysis BreakIterator Character, Word, Line or Sentence "you provide an appropriate CharacterIterator" UChar *

I was half expecting that response because I was
aware that "not all languages use white space to
separate words", but hoping for some magic in
NSString.
Unfortunately, CFStringTokenizer is not available
in 10.3.9, and no, I do not have a chest of silver
or gold behind my pillow to run around buying
newer hardware and software, let alone doing so
every 2 years; we're in re-boot-strapping mode
in the land of the globalized Bush-Clinton-Bush-Obama
depression.

This makes ICU suspect:
"Copyright (c) 2000 - 2008 IBM and Others"
Is Apple one of the "Others"?  The "using ICU"
list is a mixed bag of reputable firms and
unethical rogues, and I don't see any
additional info on who is behind "ICU".

As much as I enjoy languages (I've taken a few
in college, and 10 years ago I was on a couple
Unicode e-mailing lists mainly to read the
interesting discussion about the differences),
for now I'll stick with the US+Euro+Japanese+
Latin+Hebrew solution that I have, that uses
Objective-C and doesn't drag me into the
complications of Objective-C++ and transferring
data around to different stores based on
different programming language and framework
conventions, that I can immediately use,
can trust, and seems amenable to reasonable
later modification to handle the remote out-liers.

Onward.



_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Prev by Date: Re: Writable dir for non-admin user outside user's dir
  • Next by Date: Re: ObjectAlloc and objects that should have been released
  • Previous by thread: Re: Full content of "Cocoa Design Patterns" available as "Rough-Cut" on-line
  • Next by thread: Adding objects to NSMutableArray out of order?
  • Index(es):
    • Date
    • Thread