On 19/06/2006 6:30 AM, Paddy wrote: > Kay Schluehr wrote: >> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a >> regular expression sx from it, such that sx.match(s) yields a SRE_Match >> object when s starts with an s_i for one i in [0,...,n]. There might >> be relations between those strings: s_k.startswith(s_1) -> True or >> s_k.endswith(s_1) -> True.
Kay, what is the relevance of one string being a suffix of another? I don't see how that could affect the outcome. An extreme case would be ls = ['a', 'aa', >> ...,'aaaa...ab']. For this reason SRE_Match should provide the longest >> possible match. >> >> Is there a Python module able to create an optimized regex rx from ls >> for the given constraints? Optimised with respect to what? speed? ease of construction? I presume that you will ensure that the members of the list are unique. Note that the Python regex engine will consider each candidate in Paddy's solution left to right until it gets a match or reaches the end (that's why the reverse sort is needed to get longest possible match). This is worst-case O(N) where N is the total of the lengths of all the strings in your list. As far as I can see, this is the only basic solution (modulo one or two twiddles -- see below) using Python's re. You could possibly consider producing "zzz|foo(?:bar)?|aaa" instead of "zzz|foobar|foo|aaa" -- but whether that would run sufficiently faster to offset the cost of construction is anybody's guess. How many strings in your list? Average/maximum length? Likelihood of ls[i].startswith(ls[j]) == True? unicode or str? Your requirements are rather constrained: "sx.match(s) yields a SRE_Match object" ... why do you need this? Surely all you need is matched_len (which may be zero) such that s[:matched_len] is the matched prefix. I would have thought the way to approach this would be a simple character-by-character tree/trie-driven lookup. This would be worst case O(n) where n is the length of the longest string in your list. There may well be a Python-callable gadget for this on the net somewhere. Google "Danny Yoo ahocorasick" for a Python-callable solution to a similar but more complex problem. A cheap hack using Python's re: divide the problem according to first character: prefix_dict_match = { 'a': re.compile('alpaca|alligator').match, 'f': re.compile('foobar|foo').match, 'z': re.compile('zoo|zebra').match, } if s and s[0] in prefix_dict_match: match_obj = prefix_dict_match[s[0]](s) else: match_obj = None >> >> Regards, >> Kay > > A start would be: > regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")" > But the above does not work if you have special characters in your > strings. Paddy, fixing that problem, and "optimising" by removing the redundant ^() metacharacters: regexp = "|".join(map(re.escape, sorted(ls, reverse=True))) Hoping some of this helps, Regards, John -- http://mail.python.org/mailman/listinfo/python-list