Re: regexkit [Using NSPredicate to parse strings]
Re: regexkit [Using NSPredicate to parse strings]
- Subject: Re: regexkit [Using NSPredicate to parse strings]
- From: John Engelhart <email@hidden>
- Date: Wed, 5 Mar 2008 14:48:21 -0500
On Mar 5, 2008, at 1:03 AM, Jens Alfke wrote:
On 4 Mar '08, at 8:55 PM, John Engelhart wrote:
The ICU Regex C API (the one I need to use for RegexKit, not the C+
+ one, which I haven't really looked at) is very multi-threading
unfriendly. Basically, the 'compiled' regex, the string being
matched, and the current match state are all wrapped up in the same
opaque compiled regex pointer.
Well, I'm pretty multi-threading unfriendly myself, so that hasn't
been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex
object, so you can compile it once and peel off a new copy for every
string you need to match. (I wrote, but never finished, a Cocoa ICU
wrapper before I left Apple, and I think that was my solution to the
state problem.)
Yes, the uregex_clone() function. The way I'm probably going to
tackle it is to create the 'master' compiled regex and then each
thread will demand populate its thread local cache by cloning the
master one. Not terribly pretty.
For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
uses RegexKit as its regex matching engine. This involves a list
of about 500 regexes (depending on which adblock lists you've
subscribed to) that need to be executed for every URL.
Um, can't you merge those together into a single regex by joining
them together with "or" operators? (That's a fairly typical trick
that lexers use.)
In theory, yes. In practice, no. It all comes down to the details of
how the regex engine performs its matching. For regex engines that
use a NFA matching system, your suggestion is deathly slow. If the
regex engine uses a DFA matching system, then your suggestion is the
right one, allowing essentially constant execution time (dependent on
the length of the string to match) irregardless of the number of
individual regexes you've joined together. The older AT&T `lex` and
the newer `flex` use a DFA matching system, but as a general rule of
thumb general purpose regex matching engines use a NFA matching
system. It's a classic time/space/efficiency trade-off, the 'regex
program' for a NFA system are easier to construct and much smaller
than their DFA counterparts. Most DFA matchers usually build an NFA
system first, then perform the expensive NFA -> DFA conversion step.
I did try what you suggested just to see how PCRE would respond
(something like a componentsJoinedByString:@"|"). After about 10-15
seconds of waiting, it was obvious that it was a solution that was not
going to work. :)
My zero-order approximation read on the ICU vs. PCRE on this issue
leads me to think that they are essentially equal. However, PCRE
and ICU define 'word' and 'non-word' (the regex escape sequence \w
and \W), and consequently the '(non-)word break' (escape sequence
\b and \B) very differently. Specifically, PCRE defines word and
non-word in terms of ASCII encoding ONLY, whereas ICU does not
What you're saying is that they're essentially equal, except for non-
ascii characters :)
ICU takes Unicode very, very seriously; that's its raison d'ĂȘtre.
It's the International Components for Unicode. Regexes are just one
of the things it does.
Translated to: A positive look-behind (the character just before
this point in the regex) must be a Unicode Character and a positive
look-ahead (the next character, without 'consuming' the input, must
not be a unicode character). Definitely not as elegant, but I
suspect passable.
Nope. As I said, several languages (including Japanese) have word-
break rules that are more complex than this. Multiple words run
together without any non-word characters in between. You have to use
per-language heuristics to find the breaks. (My understanding is
that Thai is especially nasty, practically requiring the use of a
dictionary to tweeze apart the individual words.)
Ah, I did some digging, and I think you're referring to the the ICU
UREGEX_UWORD regex compile time option for "enhanced Unicode word
boundaries". From the ICU header:
UREGEX_UWORD = 256 // Unicode word boundaries. If
set, \b uses the Unicode TR 29 definition of word boundaries. Warning:
Unicode word boundaries are quite different from traditional regular
expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries
Even this enhanced functionality seems to be a "better, but still
crude" approximation for true word breaking. The ICU Regex
documentation seems to recommend that if you want proper word
breaking, you use the actual ICU word breaking system, and not
regexes, precisely for the reasons you outlined above. PCRE does not
have this optional enhanced \b behavior, though I think the 'simpler,
traditional' \b behavior can still be approximated with \p{L} like
matching, but this would obviously fall far short of what's listed in
the Unicode TR29 recommendation. Looking at the recommendation,
though, I can't help but think that all that extra Unicode property
matching would make using the enhanced \b functionality a huge speed
penalty. Though I can easily see how it's in the "it doesn't matter
how slow it is if you need it" variety of trade-offs.
Obviously not a 'simple' problem. :)
And as I said, this isn't just hypothetical. It became a Priority 1,
stop-the-presses bug for my project in 2005 as soon as the Japanese
testers started trying out the functionality that used PCRE and
discovered that it didn't work.
Like I said, "it doesn't matter how slow it is if you need it". :) I
suspect you could probably hand hack a regex to emulate the Unicode
TR29 behavior in PCRE, but that would be one ugly regex._______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden