• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: regexkit [Using NSPredicate to parse strings]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexkit [Using NSPredicate to parse strings]


  • Subject: Re: regexkit [Using NSPredicate to parse strings]
  • From: Jens Alfke <email@hidden>
  • Date: Tue, 4 Mar 2008 13:08:29 -0800


On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:

I'm most-likely going to have to support many text-encodings. Say if I'm writing a document in Jaspanese (Mac OS), will I have to convert that to UTF-8 before the methods of something like RegexKit would work? Any caveats you know of that I need to be aware of? I'm learning by doing.

It's not the encoding that's an issue, at least not at the point you're running a regex. Presumably you had to deal with encodings just to get the data into an NSString in the first place.


The limitation of PCRE is in its handling of character classes. IIRC, PCRE doesn't consider any character above 0x7F to be alphanumeric, so regex character types like "\w" won't match non-ascii letters. Worse, it detects word boundaries ("\b") by looking for a transition between word and non-word characters. Here the problem isn't just that it doesn't know about non-ascii word characters; it's that some languages have more complex rules for detecting word breaks. In Japanese and Thai, for example, words are often written without spaces in between them, and you have to use linguistic rules to determine where the breaks go. ICU knows how to do this.

The problem I ran into with PCRE is that I was implementing a typical filter field (the one in Safari RSS) that needed to match word prefixes. So the search regex began with "\b" to match the word break. But it didn't work correctly on most kanji text.

(Now, this was a few years ago. It's possible that PCRE's Unicode support has been improved since. If this is important to you, go check the docs.)

—Jens

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References: 
 >Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Mike Abdullah <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Dave Camp <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: regexkit [Using NSPredicate to parse strings] (From: Jens Alfke <email@hidden>)
 >Re: regexkit [Using NSPredicate to parse strings] (From: Jonathan Dann <email@hidden>)

  • Prev by Date: Re: An Excursus
  • Next by Date: Re: Core Data and retain count
  • Previous by thread: Re: regexkit [Using NSPredicate to parse strings]
  • Next by thread: Re: regexkit [Using NSPredicate to parse strings]
  • Index(es):
    • Date
    • Thread