• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: regexkit [Using NSPredicate to parse strings]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexkit [Using NSPredicate to parse strings]


  • Subject: Re: regexkit [Using NSPredicate to parse strings]
  • From: John Engelhart <email@hidden>
  • Date: Wed, 5 Mar 2008 14:48:21 -0500


On Mar 5, 2008, at 1:03 AM, Jens Alfke wrote:

On 4 Mar '08, at 8:55 PM, John Engelhart wrote:

The ICU Regex C API (the one I need to use for RegexKit, not the C+ + one, which I haven't really looked at) is very multi-threading unfriendly. Basically, the 'compiled' regex, the string being matched, and the current match state are all wrapped up in the same opaque compiled regex pointer.

Well, I'm pretty multi-threading unfriendly myself, so that hasn't been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex object, so you can compile it once and peel off a new copy for every string you need to match. (I wrote, but never finished, a Cocoa ICU wrapper before I left Apple, and I think that was my solution to the state problem.)

Yes, the uregex_clone() function. The way I'm probably going to tackle it is to create the 'master' compiled regex and then each thread will demand populate its thread local cache by cloning the master one. Not terribly pretty.


For example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses RegexKit as its regex matching engine. This involves a list of about 500 regexes (depending on which adblock lists you've subscribed to) that need to be executed for every URL.

Um, can't you merge those together into a single regex by joining them together with "or" operators? (That's a fairly typical trick that lexers use.)

In theory, yes. In practice, no. It all comes down to the details of how the regex engine performs its matching. For regex engines that use a NFA matching system, your suggestion is deathly slow. If the regex engine uses a DFA matching system, then your suggestion is the right one, allowing essentially constant execution time (dependent on the length of the string to match) irregardless of the number of individual regexes you've joined together. The older AT&T `lex` and the newer `flex` use a DFA matching system, but as a general rule of thumb general purpose regex matching engines use a NFA matching system. It's a classic time/space/efficiency trade-off, the 'regex program' for a NFA system are easier to construct and much smaller than their DFA counterparts. Most DFA matchers usually build an NFA system first, then perform the expensive NFA -> DFA conversion step.


I did try what you suggested just to see how PCRE would respond (something like a componentsJoinedByString:@"|"). After about 10-15 seconds of waiting, it was obvious that it was a solution that was not going to work. :)

My zero-order approximation read on the ICU vs. PCRE on this issue leads me to think that they are essentially equal. However, PCRE and ICU define 'word' and 'non-word' (the regex escape sequence \w and \W), and consequently the '(non-)word break' (escape sequence \b and \B) very differently. Specifically, PCRE defines word and non-word in terms of ASCII encoding ONLY, whereas ICU does not

What you're saying is that they're essentially equal, except for non- ascii characters :)


ICU takes Unicode very, very seriously; that's its raison d'ĂȘtre. It's the International Components for Unicode. Regexes are just one of the things it does.

Translated to: A positive look-behind (the character just before this point in the regex) must be a Unicode Character and a positive look-ahead (the next character, without 'consuming' the input, must not be a unicode character). Definitely not as elegant, but I suspect passable.

Nope. As I said, several languages (including Japanese) have word- break rules that are more complex than this. Multiple words run together without any non-word characters in between. You have to use per-language heuristics to find the breaks. (My understanding is that Thai is especially nasty, practically requiring the use of a dictionary to tweeze apart the individual words.)

Ah, I did some digging, and I think you're referring to the the ICU UREGEX_UWORD regex compile time option for "enhanced Unicode word boundaries". From the ICU header:


UREGEX_UWORD = 256 // Unicode word boundaries. If set, \b uses the Unicode TR 29 definition of word boundaries. Warning: Unicode word boundaries are quite different from traditional regular expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries

Even this enhanced functionality seems to be a "better, but still crude" approximation for true word breaking. The ICU Regex documentation seems to recommend that if you want proper word breaking, you use the actual ICU word breaking system, and not regexes, precisely for the reasons you outlined above. PCRE does not have this optional enhanced \b behavior, though I think the 'simpler, traditional' \b behavior can still be approximated with \p{L} like matching, but this would obviously fall far short of what's listed in the Unicode TR29 recommendation. Looking at the recommendation, though, I can't help but think that all that extra Unicode property matching would make using the enhanced \b functionality a huge speed penalty. Though I can easily see how it's in the "it doesn't matter how slow it is if you need it" variety of trade-offs.

Obviously not a 'simple' problem.  :)


And as I said, this isn't just hypothetical. It became a Priority 1, stop-the-presses bug for my project in 2005 as soon as the Japanese testers started trying out the functionality that used PCRE and discovered that it didn't work.

Like I said, "it doesn't matter how slow it is if you need it". :) I suspect you could probably hand hack a regex to emulate the Unicode TR29 behavior in PCRE, but that would be one ugly regex._______________________________________________


Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


References: 
 >Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Mike Abdullah <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Dave Camp <email@hidden>)
 >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
 >Re: regexkit [Using NSPredicate to parse strings] (From: Jens Alfke <email@hidden>)
 >Re: regexkit [Using NSPredicate to parse strings] (From: John Engelhart <email@hidden>)
 >Re: regexkit [Using NSPredicate to parse strings] (From: Jens Alfke <email@hidden>)

  • Prev by Date: Re: KVO autonotifying complaining about custom setter return value
  • Next by Date: Re: Core Data, SQLite, and Housekeeping
  • Previous by thread: Re: regexkit [Using NSPredicate to parse strings]
  • Next by thread: Coredata - populating initial data on app load - best practices
  • Index(es):
    • Date
    • Thread