Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: RegEx question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RegEx question

Subject: Re: RegEx question
From: Christopher Nebel <email@hidden>
Date: Tue, 20 Apr 2004 10:51:16 -0700

On Apr 20, 2004, at 2:22 AM, Wim Melis wrote:

This question applies to a 'do shell script' that calls Perl for some RegEx work. All works fine, but there's one thing I couldn't find.

Is it possible to set RegEx so that when you search for a string, it also finds all the variations with diacriticals?

For instance: from a search string '/xanax/', I would like Perl to also return spellings where the letters in it have accents, tildes, umlauts, etc.

(I guess can you can tell what it's for... we're now receiving over a thousand spams a day)

Is this possible?

First off, you can do this test directly in AppleScript much more easily than you can do it in Perl:

ignoring case and diacriticals -- could include punctuation for good measure...
return x is "xanax"
end ignoring

Done. That said, this was an interesting problem, so here's the Perl answer -- yes. (Though only because modern Perl is fairly Unicode-savvy.) Perl has a nifty extra that lets you match characters that have (or don't have) a particular Unicode property. Combine this with the Unicode::Normalize module, and we get something like this:

set n to {"xanax", "xaqaX", "xanax"}
-- probably won't survive the list server, but that's a normal "xanax",
-- one with n-tilde, and one with a-acute.
repeat with i in n
-- all one line...
do shell script "perl -e 'use utf8; use Unicode::Normalize; print \"yes\" if NFD(\"" & i & "\") =~ /x\\pM*a\\pM*n\\pM*a\\pM*x\\pM*/i'"
-- returns "yes" for all the above strings.
end repeat

Here's what this does: first, we assume that the string will be, essentially, "xanax" with maybe a bunch of diacritical marks. Second, we use NFD (see the Unicode::Normalize man page) to turn the original string into the "decomposed" form (technically, Normalization Form D, hence the function name), so the marks are separate from the base letters. Finally, we match (case-insensitively) the pattern "xanax", but with "\pM*" after each letter -- that is, zero or more of any character that has the "mark" property, i.e., it's a combining mark, which all the accents are. Voila!

--Chris Nebel
AppleScript Engineering
_______________________________________________
applescript-users mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/applescript-users
Do not post admin requests to the list. They will be ignored.

References:
	>RegEx question (From: Wim Melis <email@hidden>)

Prev by Date: Re: RegEx question
Next by Date: Re: Filter reference form question
Previous by thread: Re: RegEx question
Next by thread: Re: RegEx question
Index(es):
- Date
- Thread