Re: Inverse Regex Library?
Re: Inverse Regex Library?
- Subject: Re: Inverse Regex Library?
- From: John Engelhart <email@hidden>
- Date: Thu, 17 Jul 2008 17:16:50 -0400
On Jul 17, 2008, at 10:09 AM, Philip Mötteli wrote:
Hi,
Does anybody know of a library, that takes a bunch of strings and
produces a regex-string from them?
E. g:
"Word1"
"Word2"
"Word5"
"Word8"
"Word11"
"Word19"
"Word23"
"Word45"
"Word77"
should give "Word[0-9]{1,2}". Or I would even be more happy with
"Word[0-9]+".
I've heard of Grail+. But are there any other options?
This is sort of a tricky problem. Finding and building a regular
expression that only matches a set of words is actually trivial. For
your particular example you could probably put something together in
the shell like perl -e 'while(<>) { } ....' to go through a list of
words and spit out a regex that would match only those words.
The real problem is in what you haven't said. How you're going to use
it. Do you need to build in dynamically at run time, or are all the
words known in advance and static from that point on? How fast do you
need it to be? Does it need to be able to handle UTF-16, or will it
only match ASCII? Do you want a regex that matches the list of words /
exactly/? Or is the list of words a 'hint' and you want the 'obvious'
regex (which would be the regexes you suggested)? Answers to those
questions would really change which approach I would take in trying to
solve the problem.
If you're looking for something that can construct the 'obvious' regex
from an example 'hint' list, you're probably much better off just
doing this by hand. If you need to build the 'obvious' regex
dynamically at runtime, I would suggest you rethink the problem you're
trying to solve. You might be able to get something that can
heuristically get things right most of the time, but it's likely to be
very fragile and break in unexpected ways.
Possible ideas:
Raw speed, exact match of a list of words: You could probably throw
something together quickly that will construct a DFA for the given
list of words. Performance from this is O(1), but if your list of
words is 'complicated' and large, the state tables will be huge (like,
really really huge. A naive implementation can easily blow through
megabytes of state tables for complicated patterns.).
If you need to match ASCII like text, know everything in advance at
compile time, and need blazing speed, take a look at `flex` (http://flex.sourceforge.net
) (installed with dev tools). It should be trivial to write a perl
script that takes in your lists of known words and builds a .l file
that returns a constant for each 'family' of words.
If you have a list of words, and you need to use a regex at run time,
you could use something like 'http://search.cpan.org/~dland/Regexp-Assemble-0.34/Assemble.pm'
You could take your list of words, feed them in to that module, and
it should spit out something that will a single regex that will match
your given list of words exactly. This is probably the closest thing
to a /literal/ regex inversion you're likely to find. But the regex
output will match the input list and only the input list.
Using the mentioned perl module, you may need to do some
'preprocessing' in the input to get more reasonable results. You
would probably need to do a quick scan of the list going in and spot
by eye the 'obvious' pattern candidates. In your example, you could
do a simple regex search and replace of 's/[0-9]/\d/g', which would
squash everything in to a 'Word\d' or 'Word\d\d', and the regex
assemble package would (probably) just discard the duplicates and
crank out something like 'Word\d{1,2}'.
Like I said, if you need to be able to generate the 'obvious' regex
from a sample list dynamically on the fly at run time, you're probably
going to be in for some pain. :( If everything is known in advance,
at compile time, and the list of words you need to match is relatively
small (<~30) and fairly obvious and simple, I'd just do it by hand as
you'll probably spend more time 'automating' the process than it would
have taken you to just do it._______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden