On Jul 17, 2008, at 10:09 AM, Philip Mötteli wrote:

Hi,


Does anybody know of a library, that takes a bunch of strings and produces a regex-string from them?
E. g:

"Word1"
"Word2"
"Word5"
"Word8"
"Word11"
"Word19"
"Word23"
"Word45"
"Word77"

should give "Word[0-9]{1,2}". Or I would even be more happy with "Word[0-9]+".
I've heard of Grail+. But are there any other options?


This is sort of a tricky problem. Finding and building a regular expression that only matches a set of words is actually trivial. For your particular example you could probably put something together in the shell like perl -e 'while(<>) { } ....' to go through a list of words and spit out a regex that would match only those words.

The real problem is in what you haven't said. How you're going to use it. Do you need to build in dynamically at run time, or are all the words known in advance and static from that point on? How fast do you need it to be? Does it need to be able to handle UTF-16, or will it only match ASCII? Do you want a regex that matches the list of words / exactly/? Or is the list of words a 'hint' and you want the 'obvious' regex (which would be the regexes you suggested)? Answers to those questions would really change which approach I would take in trying to solve the problem.

If you're looking for something that can construct the 'obvious' regex from an example 'hint' list, you're probably much better off just doing this by hand. If you need to build the 'obvious' regex dynamically at runtime, I would suggest you rethink the problem you're trying to solve. You might be able to get something that can heuristically get things right most of the time, but it's likely to be very fragile and break in unexpected ways.

Possible ideas:

Raw speed, exact match of a list of words: You could probably throw something together quickly that will construct a DFA for the given list of words. Performance from this is O(1), but if your list of words is 'complicated' and large, the state tables will be huge (like, really really huge. A naive implementation can easily blow through megabytes of state tables for complicated patterns.).

If you need to match ASCII like text, know everything in advance at compile time, and need blazing speed, take a look at `flex` (http://flex.sourceforge.net ) (installed with dev tools). It should be trivial to write a perl script that takes in your lists of known words and builds a .l file that returns a constant for each 'family' of words.

If you have a list of words, and you need to use a regex at run time, you could use something like 'http://search.cpan.org/~dland/Regexp-Assemble-0.34/Assemble.pm' You could take your list of words, feed them in to that module, and it should spit out something that will a single regex that will match your given list of words exactly. This is probably the closest thing to a /literal/ regex inversion you're likely to find. But the regex output will match the input list and only the input list.

Using the mentioned perl module, you may need to do some 'preprocessing' in the input to get more reasonable results. You would probably need to do a quick scan of the list going in and spot by eye the 'obvious' pattern candidates. In your example, you could do a simple regex search and replace of 's/[0-9]/\d/g', which would squash everything in to a 'Word\d' or 'Word\d\d', and the regex assemble package would (probably) just discard the duplicates and crank out something like 'Word\d{1,2}'.

Like I said, if you need to be able to generate the 'obvious' regex from a sample list dynamically on the fly at run time, you're probably going to be in for some pain. :( If everything is known in advance, at compile time, and the list of words you need to match is relatively small (<~30) and fairly obvious and simple, I'd just do it by hand as you'll probably spend more time 'automating' the process than it would have taken you to just do it._______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to