Re: regexkit [Using NSPredicate to parse strings]
Re: regexkit [Using NSPredicate to parse strings]
- Subject: Re: regexkit [Using NSPredicate to parse strings]
- From: Jens Alfke <email@hidden>
- Date: Tue, 4 Mar 2008 13:08:29 -0800
On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:
I'm most-likely going to have to support many text-encodings. Say
if I'm writing a document in Jaspanese (Mac OS), will I have to
convert that to UTF-8 before the methods of something like RegexKit
would work? Any caveats you know of that I need to be aware of? I'm
learning by doing.
It's not the encoding that's an issue, at least not at the point
you're running a regex. Presumably you had to deal with encodings just
to get the data into an NSString in the first place.
The limitation of PCRE is in its handling of character classes. IIRC,
PCRE doesn't consider any character above 0x7F to be alphanumeric, so
regex character types like "\w" won't match non-ascii letters. Worse,
it detects word boundaries ("\b") by looking for a transition between
word and non-word characters. Here the problem isn't just that it
doesn't know about non-ascii word characters; it's that some languages
have more complex rules for detecting word breaks. In Japanese and
Thai, for example, words are often written without spaces in between
them, and you have to use linguistic rules to determine where the
breaks go. ICU knows how to do this.
The problem I ran into with PCRE is that I was implementing a typical
filter field (the one in Safari RSS) that needed to match word
prefixes. So the search regex began with "\b" to match the word break.
But it didn't work correctly on most kanji text.
(Now, this was a few years ago. It's possible that PCRE's Unicode
support has been improved since. If this is important to you, go
check the docs.)
—Jens
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden