Re: regexkit [Using NSPredicate to parse strings]
Re: regexkit [Using NSPredicate to parse strings]
- Subject: Re: regexkit [Using NSPredicate to parse strings]
- From: Jens Alfke <email@hidden>
- Date: Tue, 4 Mar 2008 22:03:09 -0800
On 4 Mar '08, at 8:55 PM, John Engelhart wrote:
It's sort of ambiguous if the /usr/lib/libicucore library is
'supported' or not. I believe the general consensus is that it's
not really there for public use, hence the missing headers, but it's
also not verboten.
Yeah, this is annoying. I don't know the reason for omitting the
headers; Deborah Goldsmith would know (she's the ICU expert at Apple)
but I don't know whether she reads this list.
The ICU Regex C API (the one I need to use for RegexKit, not the C++
one, which I haven't really looked at) is very multi-threading
unfriendly. Basically, the 'compiled' regex, the string being
matched, and the current match state are all wrapped up in the same
opaque compiled regex pointer.
Well, I'm pretty multi-threading unfriendly myself, so that hasn't
been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex
object, so you can compile it once and peel off a new copy for every
string you need to match. (I wrote, but never finished, a Cocoa ICU
wrapper before I left Apple, and I think that was my solution to the
state problem.)
RegexKit spends considerable effort in trying to get access to the
raw NSString buffer, to avoid unnecessary creation and destruction
of temporary buffers to perform a match.
This is definitely a concern. I suspect this is the major reason there
isn't an NSRegularExpression API yet; there's been talk of enhancing
the ICU regex API to make it more flexible in how it accepts strings;
but IMHO waiting for this is a case of "the best being the enemy of
the good".
PCRE only works with UTF-8 encoded strings, while ICU only works in
UTF-16. [...] most NSStrings buffers tend to be in a UTF-8
compatible form, allowing fast access by PCRE. Using ICU would
require the creation of, and conversion to UTF-16 for most strings
(again, usage dependent), only to be released/freed right after use.
I looked into this once. CFStrings (and NSStrings) are stored in one
of two formats: (1) UTF-16, or (2) the "default C encoding". The
latter varies by what your current locale is, but it defaults to ...
MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are
*never* stored in UTF-8 form, at least not in English-speaking
locales. (On the other hand, CFString is fairly smart about encodings,
so if the string is all-ascii, it realizes that's compatible with
UTF-8 and can return the raw buffer if you ask for UTF-8.)
In my limited experiments, most strings I looked at were being stored
in UTF-16. But it's heavily dependent on how the strings were created
and what characters they contain, so YMMV.
For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
uses RegexKit as its regex matching engine. This involves a list of
about 500 regexes (depending on which adblock lists you've
subscribed to) that need to be executed for every URL.
Um, can't you merge those together into a single regex by joining them
together with "or" operators? (That's a fairly typical trick that
lexers use.)
My zero-order approximation read on the ICU vs. PCRE on this issue
leads me to think that they are essentially equal. However, PCRE
and ICU define 'word' and 'non-word' (the regex escape sequence \w
and \W), and consequently the '(non-)word break' (escape sequence \b
and \B) very differently. Specifically, PCRE defines word and non-
word in terms of ASCII encoding ONLY, whereas ICU does not
What you're saying is that they're essentially equal, except for non-
ascii characters :)
ICU takes Unicode very, very seriously; that's its raison d'être. It's
the International Components for Unicode. Regexes are just one of the
things it does.
Translated to: A positive look-behind (the character just before
this point in the regex) must be a Unicode Character and a positive
look-ahead (the next character, without 'consuming' the input, must
not be a unicode character). Definitely not as elegant, but I
suspect passable.
Nope. As I said, several languages (including Japanese) have word-
break rules that are more complex than this. Multiple words run
together without any non-word characters in between. You have to use
per-language heuristics to find the breaks. (My understanding is that
Thai is especially nasty, practically requiring the use of a
dictionary to tweeze apart the individual words.)
And as I said, this isn't just hypothetical. It became a Priority 1,
stop-the-presses bug for my project in 2005 as soon as the Japanese
testers started trying out the functionality that used PCRE and
discovered that it didn't work.
—Jens
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden