Iain King wrote:
On Jan 21, 2:18 pm, Wilbert Berendsen <wbs...@xs4all.nl> wrote:
Op maandag 18 januari 2010 schreef Adi:

keys = [(len(key), key) for key in mapping.keys()]
keys.sort(reverse=True)
keys = [key for (_, key) in keys]
pattern = "(%s)" % "|".join(keys)
repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"
re.subn(pattern, repl, s)
I managed to make it even shorted, using the key argument for sorted, not
putting the whole regexp inside parentheses and pre-compiling the regular
expression:

import re

mapping = {
        "foo" : "bar",
        "baz" : "quux",
        "quuux" : "foo"

}

# sort the keys, longest first, so 'aa' gets matched before 'a', because
# in Python regexps the first match (going from left to right) in a
# |-separated group is taken
keys = sorted(mapping.keys(), key=len)

rx = re.compile("|".join(keys))
repl = lambda x: mapping[x.group()]
s = "fooxxxbazyyyquuux"
rx.sub(repl, s)

One thing remaining: if the replacement keys could contain non-alphanumeric
characters, they should be escaped using re.escape:

rx = re.compile("|".join(re.escape(key) for key in keys))

Met vriendelijke groet,
Wilbert Berendsen

--http://www.wilbertberendsen.nl/
"You must be the change you wish to see in the world."
        -- Mahatma Gandhi

Sorting it isn't the right solution: easier to hold the subs as tuple
pairs and by doing so let the user specify order.  Think of the
following subs:

"fooxx" -> "baz"
"oxxx" -> "bar"

does the user want "bazxbazyyyquuux" or "fobarbazyyyquuux"?

Iain
There is no way you can automate a user's choice. If he wants the second choice (oxxx->bar) he would have to add a third pattern: fooxxx -> fobar. In general, the rules 'upstream over downstream' and 'long over short' make sense in practically all cases. With all but simple substitution runs whose functionality is obvious, the result needs to be checked for unintended hits. To use an example from my SE manual which runs a (whimsical) text through a set of substitutions concentrating overlapping targets:

>>> substitutions = [['be', 'BE'], ['being', 'BEING'], ['been', 'BEEN'], ['bee', 'BEE'], ['belong', 'BELONG'], ['long', 'LONG'], ['longer', 'LONGER']] >>> T = Translator (substitutions) # Code further up in this thread handling precedence by the two rules mentioned >>> text = "There was a bee named Mabel belonging to hive nine longing to be a beetle and thinking that being a bee was okay, but she had been a bee long enough and wouldn't be one much longer."
>>> print T (text)
There was a BEE named MaBEl BELONGing to hive nine LONGing to BE a BEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEE LONG enough and wouldn't BE one much LONGER.

All word-length substitutions resolve correctly. There are four unintended translations, though: MaBEl, BELONGing, LONGing and BEEtle. Adding the substitution Mabel->Mabel would prevent the first miss. The others could be taken care of similarly by replacing the target with itself. With large substitution sets and extensive data, this amounts to an iterative process of running, checking and fixing, many times over. That just isn't practical and may have to be abandoned when the substitutions catalog grows out of reasonable bounds. Dependable are runs where the targets are predictably singular, such as long id numbers that cannot possibly match anything but id numbers.

Frederic





--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to