Re: substitution

Anthra Norell Thu, 21 Jan 2010 23:31:52 -0800

Iain King wrote:

On Jan 21, 2:18 pm, Wilbert Berendsen <wbs...@xs4all.nl> wrote:

Op maandag 18 januari 2010 schreef Adi:

keys = [(len(key), key) for key in mapping.keys()]
keys.sort(reverse=True)
keys = [key for (_, key) in keys]

pattern = "(%s)" % "|".join(keys)

repl = lambda x : mapping[x.group(1)]
s = "fooxxxbazyyyquuux"

re.subn(pattern, repl, s)

I managed to make it even shorted, using the key argument for sorted, not
putting the whole regexp inside parentheses and pre-compiling the regular
expression:

import re

mapping = {
        "foo" : "bar",
        "baz" : "quux",
        "quuux" : "foo"

}

# sort the keys, longest first, so 'aa' gets matched before 'a', because
# in Python regexps the first match (going from left to right) in a
# |-separated group is taken
keys = sorted(mapping.keys(), key=len)

rx = re.compile("|".join(keys))
repl = lambda x: mapping[x.group()]
s = "fooxxxbazyyyquuux"
rx.sub(repl, s)

One thing remaining: if the replacement keys could contain non-alphanumeric
characters, they should be escaped using re.escape:

rx = re.compile("|".join(re.escape(key) for key in keys))

Met vriendelijke groet,
Wilbert Berendsen

--http://www.wilbertberendsen.nl/
"You must be the change you wish to see in the world."
        -- Mahatma Gandhi


Sorting it isn't the right solution: easier to hold the subs as tuple
pairs and by doing so let the user specify order.  Think of the
following subs:

"fooxx" -> "baz"
"oxxx" -> "bar"

does the user want "bazxbazyyyquuux" or "fobarbazyyyquuux"?

Iain

There is no way you can automate a user's choice. If he wants the secondchoice (oxxx->bar) he would have to add a third pattern: fooxxx ->fobar. In general, the rules 'upstream over downstream' and 'long overshort' make sense in practically all cases. With all but simplesubstitution runs whose functionality is obvious, the result needs to bechecked for unintended hits. To use an example from my SE manual whichruns a (whimsical) text through a set of substitutions concentratingoverlapping targets:

>>> substitutions = [['be', 'BE'], ['being', 'BEING'], ['been','BEEN'], ['bee', 'BEE'], ['belong', 'BELONG'], ['long', 'LONG'],['longer', 'LONGER']]>>> T = Translator (substitutions) # Code further up in this threadhandling precedence by the two rules mentioned>>> text = "There was a bee named Mabel belonging to hive nine longingto be a beetle and thinking that being a bee was okay, but she had beena bee long enough and wouldn't be one much longer."

>>> print T (text)

There was a BEE named MaBEl BELONGing to hive nine LONGing to BE aBEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEELONG enough and wouldn't BE one much LONGER.

All word-length substitutions resolve correctly. There are fourunintended translations, though: MaBEl, BELONGing, LONGing and BEEtle.Adding the substitution Mabel->Mabel would prevent the first miss. Theothers could be taken care of similarly by replacing the target withitself. With large substitution sets and extensive data, this amounts toan iterative process of running, checking and fixing, many times over.That just isn't practical and may have to be abandoned when thesubstitutions catalog grows out of reasonable bounds. Dependable areruns where the targets are predictably singular, such as long id numbersthat cannot possibly match anything but id numbers.


Frederic





--
http://mail.python.org/mailman/listinfo/python-list

Re: substitution

Reply via email to