Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: regexkit [Using NSPredicate to parse strings]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexkit [Using NSPredicate to parse strings]

Subject: Re: regexkit [Using NSPredicate to parse strings]
From: John Engelhart <email@hidden>
Date: Tue, 4 Mar 2008 23:55:33 -0500

On Mar 4, 2008, at 12:50 PM, Jens Alfke wrote:

On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
That is a seriously good framework, and the documentation is great too.
My only issue with regexkit is that it uses PCRE instead of ICU.


[disclosure: I am the author of RegexKit]

Hard to make everyone happy.  :)

PCRE has to be compiled into the library, making it larger (whereas ICU is already built into the OS.)

True, and not only compiled it, but compiled in four times for the four 10.5 architectures (ppc, ppc-64, i386, x86-64). The framework with all four architectures ends up clocking in at ~1.4MB, whereas just the 32 bit architectures comes in around 680KB. That's roughly 371KB per architecture, of which only about 35% (~130KB) is the PCRE library itself believe it or not. Everything is compiled '-Oz' (sic) and dead code stripped to squeeze it down as much as possible.

The latest 0.6.0 release of RegexKit put in motion a few API changes, one in particular that's relevant to this discussion is the "library:" argument to the method selectors. This is a forward looking change to support more than just the PCRE library, with the obvious candidate being the ICU library shipped with the system. I actually have ICU support hacked in to the in-development version of RegexKit, but I'm not particularly happy with it, so I'm not sure if I'm going to squeeze it in to the next release. Some of my concerns are:

o It's sort of ambiguous if the /usr/lib/libicucore library is 'supported' or not. I believe the general consensus is that it's not really there for public use, hence the missing headers, but it's also not verboten. Even light weight versions of the ICU library are several orders of magnitude larger than PCRE, libicucore is 6.5 megabytes for the 4 archs, or about 1.65MB per arch- more than the size of RegexKit for all archs.

o The ICU Regex C API (the one I need to use for RegexKit, not the C++ one, which I haven't really looked at) is very multi-threading unfriendly. Basically, the 'compiled' regex, the string being matched, and the current match state are all wrapped up in the same opaque compiled regex pointer. Since RegexKit is designed to be extremely multi-threading friendly, this presents a bit of a problem. Actually, quite a bit of a non-trivial problem, at least if you want it done in something you'd consider 'fast'. PCRE, in contrast, compiles a regex in to a stateless form, which can be used by any thread without any special caveats. RegexKit keeps this compiled form in a special multi-threading aware/safe cache, so a regex is only compiled once- the first time it's used, after that all threads use the cached form.

o RegexKit spends considerable effort in trying to get access to the raw NSString buffer, to avoid unnecessary creation and destruction of temporary buffers to perform a match. PCRE only works with UTF-8 encoded strings, while ICU only works in UTF-16. Though very heavily usage dependent, in practice (this coming from a north american, ASCII and English native speaker mind you) most NSStrings buffers tend to be in a UTF-8 compatible form, allowing fast access by PCRE. Using ICU would require the creation of, and conversion to UTF-16 for most strings (again, usage dependent), only to be released/freed right after use.

I tend to find that people typically don't use a regex as a simple, one shot operation. They tend to be used repeatedly, often, and a lot. Examples are text processing and performing matches on every line in a file, performing several regex operations per line. For example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses RegexKit as its regex matching engine. This involves a list of about 500 regexes (depending on which adblock lists you've subscribed to) that need to be executed for every URL. A typical web page can contain 50 - 100 URL's, and the worst case is 'Not matched by any of the regexes', requiring all of them to be evaluated. That's 25,000 to 50,000 regex matches per web page for the worst case. This is why AdBlock in Firefox can result in a performance hit, but in my testing Safari AdBlock actually results in pages loading faster even with all the extra work.

I put quite a bit of effort in to making this kind of matching 'go fast', which includes sorting the regexes in the list (NSArray or NSSet) by the number of times each one has successfully matched, trying the ones that have matched the most first. This 'auto- optimizes' the order in which the regexes are tried. It also keeps a small LRU cache of 'misses', so if a URL appears more than once but wasn't blocked, it can bypass the whole regex evaluation process. On top of all of this, in keeping with the multi-threading friendly nature, it automagically parallelizes the matching process- one match attempt per CPU since each match is independent of all the others. It's also simple to tap in to this high speed matching: [@"string to check" isMatchedByAnyRegexInArray:arrayOfRegexes], along with a few other methods.

PCRE is also, last I checked, less I18N-savvy than ICU. This has given me grief in the past; I used PRCE-based regex code in a project 3 years ago, and as soon as the Japanese and Korean testers started working with it, they found that the app's text searching didn't work correctly for them. (In a nutshell, PCRE's notion of "alphabetic characters" and "word breaks" only works for Roman writing systems.)

Again, I'm a native English only speaker, so I will happily defer to just about anyone else on these points. My zero-order approximation read on the ICU vs. PCRE on this issue leads me to think that they are essentially equal. However, PCRE and ICU define 'word' and 'non- word' (the regex escape sequence \w and \W), and consequently the '(non-)word break' (escape sequence \b and \B) very differently. Specifically, PCRE defines word and non-word in terms of ASCII encoding ONLY, whereas ICU does not (though I can't find a handy link to the exact definition used by ICU, it's obviously more than just ASCII). I suspect the PCRE definition is rooted in compatibility concerns.

However, I suspect that the functional equivalent could be constructed using the \p{} and \P{} (\P being the "with out" version the \p "with" form). I suspect an ICU \w analog in PCRE would be \p{L} (match a Unicode Letter), and \b could be simulated with something like:

(?<=\p{L})(?=\P{L})

Translated to: A positive look-behind (the character just before this point in the regex) must be a Unicode Character and a positive look- ahead (the next character, without 'consuming' the input, must not be a unicode character). Definitely not as elegant, but I suspect passable.

As an aside, RegexKit includes the PCRE build time options for UTF-8 Unicode and Unicode properties. Since Foundation uses UTF-16 as its abstract representation for all strings, and thus their NSRange values, and PCRE uses UTF-8 as its representation, RegexKit tries to do 'the right thing' and automatically hide the conversion details for things such as NSRange where appropriate. As a rule of thumb, if it's Foundation, RegexKit takes as input and provides UTF-16 based indexes and NSRanges, but uses UTF-8 if matching against a raw byte buffer (typically only the RKRegex class does this). So, all the NSString category additions use Foundations native UTF-16 representation, allowing seamless interoperability with other NSString methods that use NSRange values.

Also along the i18n train of thought, I started a big push for localization for RegexKit 0.6.0. This centered around adding "error:" arguments to methods so they could return standard NSError objects, and localizing all the strings RegexKit uses. In addition to this, I re-wrote the text of all the PCRE error strings so they were up to "Aqua UI" standards while at the same time putting them in to .strings localized form. I obviously can't localize the strings, but it should make it easier for anyone who does have to use RegexKit and requires localized regex error strings. The goal is to make it a simple matter of handing off the returned NSError object to whatever alert display mechanism you want and get a high quality error dialog. This is a work in progress right now, however, as it required updating a lot of the API to return NSError objects (which, in hindsight, I should have done in the first place. Live and learn.)

Unfortunately I don't know of a comparable Cocoa regex library that uses ICU. (NSPredicate does, but its support for regexes is very limited, as already discussed in this thread.)

RegexKit will likely include this in the future, though I won't promise the next release. Right now I'm not happy with the hackery required to wedge it in to RegexKit. For those interested, Oniguruma is probably right out. It does not seem to be terribly multi- threading friendly, with the easiest solution to "giant lock" all calls to the library. Not exactly compatible with the goals of RegexKit. _______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


Follow-Ups:

Re: regexkit [Using NSPredicate to parse strings]
From: Jens Alfke <email@hidden>


References:  
  >Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
  >Re: Using NSPredicate to parse strings (From: Mike Abdullah <email@hidden>)
  >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
  >Re: Using NSPredicate to parse strings (From: Dave Camp <email@hidden>)
  >Re: Using NSPredicate to parse strings (From: Jonathan Dann <email@hidden>)
  >Re: regexkit [Using NSPredicate to parse strings] (From: Jens Alfke <email@hidden>)




Prev by Date:
Re: Undo and Core Data issue

Next by Date:
CATiledLayer too slow to update

Previous by thread:
Re: regexkit [Using NSPredicate to parse strings]

Next by thread:
Re: regexkit [Using NSPredicate to parse strings]

Index(es):

Date
Thread