Re: NSString and regular expressions
Re: NSString and regular expressions
- Subject: Re: NSString and regular expressions
- From: John Engelhart <email@hidden>
- Date: Fri, 31 Jul 2009 00:42:38 -0400
On Thu, Jul 30, 2009 at 8:04 PM, BareFeet <email@hidden>wrote:
> Hi John and all,
>
> You might want to look at AGRegex which is very compact (one class) and
>>> which uses PCRE:
>>>
>>> http://colloquy.info/project/browser/trunk/Frameworks/AGRegex
>>>
>>>
>> Of note, Colloquy appears to have switched to RegexKitLite itself:
>>
>> http://svn.colloquy.info/project/changeset/4301
>
>
Just to be clear, I'm the author of RegexKitLite (and RegexKit.framework).
I just like to be up front about that so you can apply whatever amount of
bias filtering you want to any claims or statements I make.
> <http://svn.colloquy.info/project/changeset/4301>
>
> I did notice that log entry, but thought it was never acted upon (ie they
> are still using AGRegex).
I can't say I did any kind of exhaustive check, but I was under the
impression that they had definitely switched over. I even got a bug report
from them.
>
> RegexKitLite looks promising. It claims to only require you to add the .h
> and .m file to your project and link to the libicucore.dylib library.
>
> The documentation notes: "Warning: Apple does not officially support
> linking to the libicucore.dylib library." In reality, how worried should I
> be about this? I am amazed that Cocoa doesn't provide regex itself. Surely
> Apple must provide or recommend something to do the job.
wrt/ to linking to libicucore.dylib, that's kind of a grey area. I try to
be as up front as possible about that fact in the documentation. What
follows is my opinion and carries no official weight. So far as I know it's
an accurate representation of the facts, and I've tried to keep it
objective:
The shared library that causes the controversy is /usr/lib/libicucore.dylib.
I've searched the documentation and I could find nothing that explicitly
forbids linking against it, or anything else in /usr/lib. If one subscribes
to the common unix traditions, the /usr/lib directory is generally
considered "fair game" for linking against- it is one of the common
locations for a systems publicly available shared libraries. By placing a
library in /usr/lib, one implicitly declares it "publicly available".
The next stumbling block is the need for headers. A default install of Mac
OS X does not include the ICU headers one would normally need to make use of
the ICU library. However, the ICU project is an open source project, so one
can (easily?) assemble a suitable set of headers if one is so inclined. Not
only that, but Apple provides a tar ball of their branch of ICU that is used
to build the binaries that are present on every Mac OS X system.
Furthermore, that tar balls make file includes a target to install the ICU
headers on your system. Although a bit convoluted to actually get, Apple
does publicly provide the headers for the ICU library. See
http://www.opensource.apple.com/tarballs/ICU/ for the tar balls.
After that, the next criteria is whether or not the API is documented. It's
safe to say that the ICU API is documented, although not by Apple. Apple
actually refers to the ICU documentation in certain parts of its official
documentation (NSPredicate wrt/ regular expressions and the MATCHES
operator).
So, it comes down to a matter of opinion and a judgement call. Considering
how easy it is to create a location in the file system that makes it clear
that the shared libraries within are private, I'm of the opinion that the
/usr/lib/libicucore.dylib file is definitely in the public category. Even
private frameworks have their own slice in
/System/Library/PrivateFrameworks, which makes it pretty clear that the
contents within are off-limits. Even within public frameworks their is the
PrivateHeaders folder for non-public API information.
Up next is whether or not the lack of headers makes the library "private".
If this was a proprietary library, I'd probably lean towards "makes it
private". However, it's a publicly available open-source project, so it
becomes a little more grey. The fact that Apple publicly provides
everything needed to build an exact copy of the version of ICU that's
shipped with system, and the ability to install the headers makes it really
grey. Personally, I'm inclined to say that it's in the "not private"
category. I think it's fair to say that the "undocumented API" clause
doesn't apply.
Finally, I'm not aware of any official decrees that explicitly make
/usr/lib/libicucore.dylib a "private API". What advice that has come from
Apple has been extremely ambiguous, usually with a caveat along the lines of
"this may not be officially supported".
>From a purely pragmatic perspective, it makes a lot of sense for Apple to
provide the headers and make it an "Official, Public API". First and
foremost is consistency for applications- it removes the need for every
developer to duplicate the work that's already been done, and fill their
.app/ distribution with yet another copy of a (rather large) shared library.
Another big plus is that from a security point of view- if a problem is
found in the ICU library, Apple can provide an updated shared library with
the fix and every single application that links against it is automatically
'patched'. That's a fairly compelling reason all by itself.
Moving on to why Apple doesn't provide this functionality, well. I don't
work for Apple, so this is nothing but raw speculation based on snippets of
public posting. It's my understanding that one of the big stumbling blocks
has been the fact that the ICU regex engine can only match text that is
encoded as UTF-16. NSStrings (or, more correctly, CFString) keeps its
internal (normal warnings about the internal, private details of an objects
implementation apply) buffer of a strings contents in either an 8-bit format
or UTF-16. The 8-bit format is normally MacOSRoman, which is a superset of
ASCII. An awful lot of strings can be encoded as MacOSRoman, and takes up
half the space of its UTF-16 equivalent (1 byte per character vs nominally 2
bytes per character). Soo, there's a bit of an impedance mismatch since the
ideal situation would be that the ICU regex engine be encoding agnostic wrt/
to the text it's searching. RegexKitLite dodges this bullet by keeping a
cache of the most recent UTF-16 conversion for a string, if one was even
needed (if a strings backing buffer is already UTF-16 encoded, it just uses
that directly). This is works out well for the majority of usage cases.
The usual caveats regarding caching apply: Caching works by exploiting
temporal locality of typical usage patterns- usage patterns that exceed the
"working set" capacity of a cache can cause a dramatic drop in performance.
>
> As quoted earlier:
>
> Unfortunately, RegexKit Lite (the stripped-down version) uses the built-in
>>> ICU library which uses a syntax quite different to the PCRE that most people
>>> are used to.
>>>
>>
> At first glance through the "ICU Syntax" documentation included with
> RegexKitLite, it appears the same as what I'm used to. At least it supports
> \s for whitespace, \w for words, (?=...) for look ahead. I did, however,
> discover:
>
> Single Quote
>> Two single quotes represent a single quote, either inside or outside
>> single quotes. Text within single quotes is not interpreted in any way,
>> except for two adjacent single quotes. It is taken as literal text— special
>> characters become non-special. These quoting conventions for ICU character
>> classes differ from those of Perl or Java. In those environments, single
>> quotes have no special meaning, and are treated like any other literal
>> character.
>>
>
> I guess I can deal with that.
>
> Has anyone discovered any other issues (or had successes) dealing with ICU
> syntax in RegexKitLite and RegexKitLite in general?
In general, I've found the ICU and PCRE regex syntax to be essentially
identical, at least for the most commonly used regex features. About the
only thing I really miss from PCRE is "named capture patterns". Off the top
of my head, these are features in PCRE that aren't in ICU:
Named Captures (and, by extension, named back-references)
Recursive and conditional patterns, along with pattern subroutines.
PCRE's more elaborate backtracking control
A handful of not commonly used meta-characters (ie, \R would be nice to
have, and I can't think of anything else off the top of my head)
So, essentially, some of the very advanced and not commonly used features.
Some features present in ICU and not in PCRE:
Vastly more sophisticated word breaking (regex pattern option, can do word
breaking on Thai, for example) This can be a make or break feature in it's
own right- if you need it, you /NEED/ it.
More elaborate character sets (can perform basic (math) set operations on []
character sets: union, intersection, minus).
The \p / \P {} meta-characters can accept "more" stuff, pretty much the
whole gamut of Unicode properties.
Not meant to be an exhaustive list, but I think it covers the majors.
There's also some minor idiosyncrasies, such as when some characters need
to be escaped relative to the other syntax, but they're fairly rare (I can't
even come up with an example on the fly, I just remember that it's popped up
from time to time).
One final note- if the strings you're going to be matching are predominately
"unicode heavy" (ie, not simple ASCII or MacOSRoman), definitely go for ICU
/ RegexKitLite. While the ICU regex engine is limited to only working on
UTF-16 encoded strings, PCRE can only work on UTF-8 strings. NSString
'ranges' are always in UTF-16 code points, which for most languages is a 1:1
mapping of offset to character. UTF-8 uses a variable length encoding
format and converting between UTF-8 byte offsets to UTF-16 character offsets
is brutal.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden