Re: RegEx question
Re: RegEx question
- Subject: Re: RegEx question
- From: Christopher Nebel <email@hidden>
- Date: Tue, 20 Apr 2004 10:51:16 -0700
On Apr 20, 2004, at 2:22 AM, Wim Melis wrote:
This question applies to a 'do shell script' that calls Perl for some
RegEx work. All works fine, but there's one thing I couldn't find.
Is it possible to set RegEx so that when you search for a string, it
also finds all the variations with diacriticals?
For instance: from a search string '/xanax/', I would like Perl to
also return spellings where the letters in it have accents, tildes,
umlauts, etc.
(I guess can you can tell what it's for... we're now receiving over a
thousand spams a day)
Is this possible?
First off, you can do this test directly in AppleScript much more
easily than you can do it in Perl:
ignoring case and diacriticals -- could include punctuation for good
measure...
return x is "xanax"
end ignoring
Done. That said, this was an interesting problem, so here's the Perl
answer -- yes. (Though only because modern Perl is fairly
Unicode-savvy.) Perl has a nifty extra that lets you match characters
that have (or don't have) a particular Unicode property. Combine this
with the Unicode::Normalize module, and we get something like this:
set n to {"xanax", "xaqaX", "xanax"}
-- probably won't survive the list server, but that's a normal "xanax",
-- one with n-tilde, and one with a-acute.
repeat with i in n
-- all one line...
do shell script "perl -e 'use utf8; use Unicode::Normalize; print
\"yes\" if NFD(\"" & i & "\") =~ /x\\pM*a\\pM*n\\pM*a\\pM*x\\pM*/i'"
-- returns "yes" for all the above strings.
end repeat
Here's what this does: first, we assume that the string will be,
essentially, "xanax" with maybe a bunch of diacritical marks. Second,
we use NFD (see the Unicode::Normalize man page) to turn the original
string into the "decomposed" form (technically, Normalization Form D,
hence the function name), so the marks are separate from the base
letters. Finally, we match (case-insensitively) the pattern "xanax",
but with "\pM*" after each letter -- that is, zero or more of any
character that has the "mark" property, i.e., it's a combining mark,
which all the accents are. Voila!
--Chris Nebel
AppleScript Engineering
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.