Re: regexkit [Using NSPredicate to parse strings]
Re: regexkit [Using NSPredicate to parse strings]
- Subject: Re: regexkit [Using NSPredicate to parse strings]
- From: John Engelhart <email@hidden>
- Date: Tue, 4 Mar 2008 23:55:33 -0500
On Mar 4, 2008, at 12:50 PM, Jens Alfke wrote:
On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
That is a seriously good framework, and the documentation is great
too.
My only issue with regexkit is that it uses PCRE instead of ICU.
[disclosure: I am the author of RegexKit]
Hard to make everyone happy. :)
PCRE has to be compiled into the library, making it larger (whereas
ICU is already built into the OS.)
True, and not only compiled it, but compiled in four times for the
four 10.5 architectures (ppc, ppc-64, i386, x86-64). The framework
with all four architectures ends up clocking in at ~1.4MB, whereas
just the 32 bit architectures comes in around 680KB. That's roughly
371KB per architecture, of which only about 35% (~130KB) is the PCRE
library itself believe it or not. Everything is compiled '-Oz' (sic)
and dead code stripped to squeeze it down as much as possible.
The latest 0.6.0 release of RegexKit put in motion a few API changes,
one in particular that's relevant to this discussion is the "library:"
argument to the method selectors. This is a forward looking change to
support more than just the PCRE library, with the obvious candidate
being the ICU library shipped with the system. I actually have ICU
support hacked in to the in-development version of RegexKit, but I'm
not particularly happy with it, so I'm not sure if I'm going to
squeeze it in to the next release. Some of my concerns are:
o It's sort of ambiguous if the /usr/lib/libicucore library is
'supported' or not. I believe the general consensus is that it's not
really there for public use, hence the missing headers, but it's also
not verboten. Even light weight versions of the ICU library are
several orders of magnitude larger than PCRE, libicucore is 6.5
megabytes for the 4 archs, or about 1.65MB per arch- more than the
size of RegexKit for all archs.
o The ICU Regex C API (the one I need to use for RegexKit, not the C++
one, which I haven't really looked at) is very multi-threading
unfriendly. Basically, the 'compiled' regex, the string being
matched, and the current match state are all wrapped up in the same
opaque compiled regex pointer. Since RegexKit is designed to be
extremely multi-threading friendly, this presents a bit of a problem.
Actually, quite a bit of a non-trivial problem, at least if you want
it done in something you'd consider 'fast'. PCRE, in contrast,
compiles a regex in to a stateless form, which can be used by any
thread without any special caveats. RegexKit keeps this compiled form
in a special multi-threading aware/safe cache, so a regex is only
compiled once- the first time it's used, after that all threads use
the cached form.
o RegexKit spends considerable effort in trying to get access to the
raw NSString buffer, to avoid unnecessary creation and destruction of
temporary buffers to perform a match. PCRE only works with UTF-8
encoded strings, while ICU only works in UTF-16. Though very heavily
usage dependent, in practice (this coming from a north american, ASCII
and English native speaker mind you) most NSStrings buffers tend to be
in a UTF-8 compatible form, allowing fast access by PCRE. Using ICU
would require the creation of, and conversion to UTF-16 for most
strings (again, usage dependent), only to be released/freed right
after use.
I tend to find that people typically don't use a regex as a simple,
one shot operation. They tend to be used repeatedly, often, and a
lot. Examples are text processing and performing matches on every
line in a file, performing several regex operations per line. For
example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses
RegexKit as its regex matching engine. This involves a list of about
500 regexes (depending on which adblock lists you've subscribed to)
that need to be executed for every URL. A typical web page can
contain 50 - 100 URL's, and the worst case is 'Not matched by any of
the regexes', requiring all of them to be evaluated. That's 25,000 to
50,000 regex matches per web page for the worst case. This is why
AdBlock in Firefox can result in a performance hit, but in my testing
Safari AdBlock actually results in pages loading faster even with all
the extra work.
I put quite a bit of effort in to making this kind of matching 'go
fast', which includes sorting the regexes in the list (NSArray or
NSSet) by the number of times each one has successfully matched,
trying the ones that have matched the most first. This 'auto-
optimizes' the order in which the regexes are tried. It also keeps a
small LRU cache of 'misses', so if a URL appears more than once but
wasn't blocked, it can bypass the whole regex evaluation process. On
top of all of this, in keeping with the multi-threading friendly
nature, it automagically parallelizes the matching process- one match
attempt per CPU since each match is independent of all the others.
It's also simple to tap in to this high speed matching: [@"string to
check" isMatchedByAnyRegexInArray:arrayOfRegexes], along with a few
other methods.
PCRE is also, last I checked, less I18N-savvy than ICU. This has
given me grief in the past; I used PRCE-based regex code in a
project 3 years ago, and as soon as the Japanese and Korean testers
started working with it, they found that the app's text searching
didn't work correctly for them. (In a nutshell, PCRE's notion of
"alphabetic characters" and "word breaks" only works for Roman
writing systems.)
Again, I'm a native English only speaker, so I will happily defer to
just about anyone else on these points. My zero-order approximation
read on the ICU vs. PCRE on this issue leads me to think that they are
essentially equal. However, PCRE and ICU define 'word' and 'non-
word' (the regex escape sequence \w and \W), and consequently the
'(non-)word break' (escape sequence \b and \B) very differently.
Specifically, PCRE defines word and non-word in terms of ASCII
encoding ONLY, whereas ICU does not (though I can't find a handy link
to the exact definition used by ICU, it's obviously more than just
ASCII). I suspect the PCRE definition is rooted in compatibility
concerns.
However, I suspect that the functional equivalent could be constructed
using the \p{} and \P{} (\P being the "with out" version the \p "with"
form). I suspect an ICU \w analog in PCRE would be \p{L} (match a
Unicode Letter), and \b could be simulated with something like:
(?<=\p{L})(?=\P{L})
Translated to: A positive look-behind (the character just before this
point in the regex) must be a Unicode Character and a positive look-
ahead (the next character, without 'consuming' the input, must not be
a unicode character). Definitely not as elegant, but I suspect
passable.
As an aside, RegexKit includes the PCRE build time options for UTF-8
Unicode and Unicode properties. Since Foundation uses UTF-16 as its
abstract representation for all strings, and thus their NSRange
values, and PCRE uses UTF-8 as its representation, RegexKit tries to
do 'the right thing' and automatically hide the conversion details for
things such as NSRange where appropriate. As a rule of thumb, if it's
Foundation, RegexKit takes as input and provides UTF-16 based indexes
and NSRanges, but uses UTF-8 if matching against a raw byte buffer
(typically only the RKRegex class does this). So, all the NSString
category additions use Foundations native UTF-16 representation,
allowing seamless interoperability with other NSString methods that
use NSRange values.
Also along the i18n train of thought, I started a big push for
localization for RegexKit 0.6.0. This centered around adding "error:"
arguments to methods so they could return standard NSError objects,
and localizing all the strings RegexKit uses. In addition to this, I
re-wrote the text of all the PCRE error strings so they were up to
"Aqua UI" standards while at the same time putting them in to .strings
localized form. I obviously can't localize the strings, but it should
make it easier for anyone who does have to use RegexKit and requires
localized regex error strings. The goal is to make it a simple matter
of handing off the returned NSError object to whatever alert display
mechanism you want and get a high quality error dialog. This is a
work in progress right now, however, as it required updating a lot of
the API to return NSError objects (which, in hindsight, I should have
done in the first place. Live and learn.)
Unfortunately I don't know of a comparable Cocoa regex library that
uses ICU. (NSPredicate does, but its support for regexes is very
limited, as already discussed in this thread.)
RegexKit will likely include this in the future, though I won't
promise the next release. Right now I'm not happy with the hackery
required to wedge it in to RegexKit. For those interested, Oniguruma
is probably right out. It does not seem to be terribly multi-
threading friendly, with the easiest solution to "giant lock" all
calls to the library. Not exactly compatible with the goals of
RegexKit.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden