Re: Xcode Editor's Regex now uses PCRE instead of ICU?
Re: Xcode Editor's Regex now uses PCRE instead of ICU?
- Subject: Re: Xcode Editor's Regex now uses PCRE instead of ICU?
- From: John Engelhart <email@hidden>
- Date: Wed, 12 Mar 2008 19:01:47 -0400
On Mar 11, 2008, at 9:25 AM, Alastair Houghton wrote:
On 11 Mar 2008, at 04:19, Stuart Malin wrote:
This question of what Xcode uses aside, I am curious though: I seem
to be sensing the reason for the use of ICU is its support of
Unicode. But doesn't PCRE support Unicode?
Not to the same extent that ICU does. ICU is the canonical
implementation of the Unicode spec.
Also, ICU's regex engine is implemented natively for UTF-16. PCRE's
interface is UTF-8. If, as is commonly the case for Cocoa apps,
strings are stored internally in UTF-16, you would have to convert
to use PCRE whereas ICU's engine can handle the native representation.
I've actually found this to not be true in practice. Since my
observed behavior of CFString / NSString is that it tries to avoid
converting the strings buffer it was initialized with if possible.
From an API perspective, however, strings "appear" to be UTF-16
encoded. Naturally, these are implementation internal details, so the
usual caveats apply here.
Since CFString / NSString don't keep their internal buffers in a fixed
format (ie, UTF-16), then the questions becomes one of "What is the
most common internal format?" The answer to this is extremely usage
sensitive. Unicode strings encoded in UTF-8 take a variable number of
bytes to encode, anywhere from 1 to 6 bytes, UTF-16 requires either 2
or 4 bytes to encode each character, and finally UTF-32 always
requires exactly 4 bytes to encode a character. Encoding ASCII
strings in anything other than UTF-8 automatically doubles or
quadruples the size required to store the string. Without getting in
to compression (even the Unicode standard compression), the optimal
encoding for the least amount of bytes used is heavily string dependent.
My observations on how CFString / NSString keeps it's internal buffers
(which goes without saying is an internal detail, not to be depended
on) is roughly:
If the string is ASCII, or otherwise "8-bit simple / optimal", down
convert the buffer to UTF-8 (which ASCII is a subset of) if the
initialization buffer is not already in a UTF-8 compatible format.
Otherwise, convert the string to native endian UTF-16 if the
initialization buffer is not already.
Owing to both it's English / ASCII development and Unix roots, an
awful lot of strings fall in to the first category, which is also a
win in terms of the space required for the buffer.
RegexKit has various DTrace probes embedded in it, one of them is the
"PerformanceNote" probe that will fire when some non-optimal condition
is detected. Since RegexKit uses PCRE, and thus requires UTF-8
encoded buffers, the reason I put this probe in RegexKit in the first
place was so that I could tell when just such "encoding mismatch"
issues happen. When matching a string, RegexKit checks the strings
"Fastest Encoding" to see if it's UTF-8 compatible, and uses
CFStringGetCStringPtr() to try to get direct access to the string
buffer if at all possible. If it is enable to get direct access to a
UTF-8 compatible buffer, it obviously has to go through the expensive
process of converting that string in to UTF-8, and will fire off a
DTrace PerformanceNote probe.
Since DTrace allows you to trivially snoop in on any process in the
system at any time, and Safari AdBlock happens to use RegexKit, we can
get an idea of how often the URL's that Safari AdBlock is checking,
which (presumably) it's getting straight from Safari:
shell% sudo dtrace -Z -q -n 'RegexKit*:::PerformanceNote { this-
>description = arg6 == 0 ? "" : copyinstr(arg6); printf("Note: %s\n",
this->description);}'
... [time, and web surfing passes] ...
^C
shell%
Nadda, not a single URL Safari AdBlock handed to RegexKit had to be
converted to UTF-8, which would mean that in this particular usage of
regular expression engines, using ICU would require the constant up
conversion to UTF-16 for every single URL that passed through safari
(often hundreds per page). Just to make sure things were 'working', I
kicked off the unit tests:
shell% sudo dtrace -Z -q -n 'RegexKit*:::PerformanceNote { this-
>description = arg6 == 0 ? "" : copyinstr(arg6); printf("Note: %s\n",
this->description);}'
...
Note: UTF16 to UTF8 requires slow conversion.
Note: UTF8 to UTF16 requires slow conversion.
Note: NSString encoding requires expensive UTF8 conversion.
Note: NSString encoding requires expensive UTF8 conversion.
Note: pcre_study() was able to optimize the regular expression.
Note: Slow conversion via sscanf.
...
Note the "NSString encoding requires expensive UTF8 conversion.",
indicating that the buffer for the source string was not in a UTF-8
friendly encoding, and thus required a full conversion from the source
encoding in to UTF-8. The UTF8 to UTF16 / UTF16 to UTF8 messages are
a related issue: All the offsets returned by PCRE are "UTF-8 Encoded",
and must be converted to their "UTF-16 Encoded" equivalents. Happily,
ASCII byte offsets are exactly equal to their UTF-16 character
offsets, and thus don't require a conversion between encodings.
Conversion is required due to the "from a user of the API's
perspective, everything looks like a UTF-16 encoded string"
requirement so that things like NSRange values from RegexKit are
useable with other Foundation / NSString methods.
Again, because these results are peaking in to the internal workings
of objects, the results could be completely different for someone
else. For example, CoreFoundation / Foundation would be free to
"convert the internal buffer representation to the users locale",
which might just a;ways force the conversion of ASCII strings to
UTF-16. so take the above with as large a grain of salt as you feel is
appropriate. :)
PS- CFStringGetCStringPtr() is a public API function. It's a
pragmatic tradeoff between internal details and reality- strings are
use a lot and it saves a substantial amount of overhead if you don't
have to constantly create temporary buffers for every little string
operation. The docs are pretty clear that if it works, great, but
have a (slower) backup plan for when it returns NULL:
This function either returns the requested pointer immediately, with
no memory allocations and no copying, in constant time, or returns
NULL. If the latter is the result, call an alternative function such
as the CFStringGetCString function to extract the characters.
Another difference IIRC is that the set support is very much more
sophisticated in ICU. In Perl and PCRE, there is basic support for
character sets (the square-bracket syntax), but AFAIK there are no
set operations (besides inversion), and I think the set of Unicode
properties you can query is somewhat smaller than for ICU. ICU also
supports string values as members of character sets, presumably so
that you can use combining marks and the like in a set.
Emacs! vi!
PCRE doesn't have ICU character class set operations, but the one
thing that PCRE lacks that ICU shines in is "enhanced \b break
detection". There was a recent discussion on cocoa-dev regarding this
topic, and almost simultaneously I got a request from a user for help
using RegexKit to perform word breaking with \b on Thai strings. In
PCRE, \w \d \s and \b (and friends) are only "ASCII aware", but in ICU
are "Unicode aware" (surprise).
There is an option when compiling a regex in ICU, UREGEX_UWORD, which
turns on "enhanced" \b behavior. As I've come to learn (speaking only
English myself), this is one of those things that if you need it, you
NEED it, and there is no simple work around. The regular ICU \b
behavior is "like" PCRE's, or can be reasonably simulated in PCRE with
\p{} and assertions, but the enhanced \b brings the specialized,
dictionary driven ICU word breaker to bear on finding word breaks:
The regex "(\w+?)\b" with UREGEX_UWORD turned on:
[johne@LAPTOP_10_5] icu% ./icu_matcher
2008-03-07 20:10:34.774 icu_matcher[43421:807] subject: 'ฉัน
กินข้าว'
2008-03-07 20:10:34.865 icu_matcher[43421:807] matched: 'ฉัน'
2008-03-07 20:10:34.869 icu_matcher[43421:807] range : '{0, 3}'
2008-03-07 20:10:34.873 icu_matcher[43421:807] matched: 'กิน'
2008-03-07 20:10:34.877 icu_matcher[43421:807] range : '{3, 3}'
2008-03-07 20:10:34.881 icu_matcher[43421:807] matched: 'ข้าว'
2008-03-07 20:10:34.884 icu_matcher[43421:807] range : '{6, 4}'
And turned off:
[johne@LAPTOP_10_5] icu% ./icu_matcher
2008-03-12 18:49:50.788 icu_matcher[88171:807] subject: 'ฉัน
กินข้าว'
2008-03-12 18:49:50.882 icu_matcher[88171:807] matched: 'ฉัน
กินข้าว'
2008-03-12 18:49:50.888 icu_matcher[88171:807] range : '{0, 10}'
UREGEX_UWORD turned on (sic) under 10.4:
[johne@LAPTOP_X86] /tmp% ./icu_matcher
2008-03-08 16:18:03.932 icu_matcher[11810] subject: 'ฉันกิน
ข้าว'
2008-03-08 16:18:03.954 icu_matcher[11810] matched: 'ฉันกิน
ข้าว'
2008-03-08 16:18:03.954 icu_matcher[11810] range : '{0, 10}'
Well, one can dream, I suppose. The user ended up "fixing" this by
building the latest version of ICU under 10.4 and linking against
that. I don't know why 10.4 behaves like this. The latest ICU
version is 3.8, 10.5 uses 3.6, and 10.4 uses 3.2. I suspect, however,
it's because 10.4 omits the Thai word breaking dictionary and thus
falls back to ordinary \b behavior. A (very) rough estimate is the
breaking dictionaries are about 450K for everything with the thai word
breaking dictionary weighing in at ~240K alone!
Kind regards,
Alastair.
--
http://alastairs-place.net
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden