> From: Steven D'Aprano <st...@remove-this-cybersource.com.au> > To: python-l...@python.org > Date: 18 Jan 2010 16:26:48 GMT > Subject: Re: substitution > On Mon, 18 Jan 2010 06:23:44 -0800, Iain King wrote: > >> On Jan 18, 2:17 pm, Adi Eyal <a...@digitaltrowel.com> wrote: > [...] >>> Using regular expressions the answer is short (and sweet) >>> >>> mapping = { >>> "foo" : "bar", >>> "baz" : "quux", >>> "quuux" : "foo" >>> >>> } >>> >>> pattern = "(%s)" % "|".join(mapping.keys()) >>> repl = lambda x : mapping.get(x.group(1), x.group(1)) >>> s = "fooxxxbazyyyquuux" >>> re.subn(pattern, repl, s) >> >> Winner! :) > > What are the rules for being declared "Winner"? For the simple case > given, calling s.replace three times is much faster: more than twice as > fast.
:) - The solution above is short and easy to read and digest. It is also easy to maintain - you simply need to add additional entries to the mapping dict if you need to add additional strings. Calling replace 3 times in a row doesn't work correctly e.g. foo => bar bar => baz foo.replace("foo", "bar").replace("bar", "baz") > > But a bigger problem is that the above "winner" may not work correctly if > there are conflicts between the target strings (e.g. 'a'->'X', > 'aa'->'Y'). The problem is that the result you get depends on the order > of the searches, BUT as given, that order is non-deterministic. > dict.keys() returns in an arbitrary order, which means the caller can't > specify the order except by accident. For example: > >>>> repl = lambda x : m[x.group(1)] >>>> m = {'aa': 'Y', 'a': 'X'} >>>> pattern = "(%s)" % "|".join(m.keys()) >>>> subn(pattern, repl, 'aaa') # expecting 'YX' > ('XXX', 3) > > The result that you get using this method will be consistent but > arbitrary and unpredictable. > That's a simple fix import re mapping = { "foo" : "bar", "baz" : "quux", "quuux" : "foo" } keys = [(len(key), key) for key in mapping.keys()] keys.sort(reverse=True) keys = [key for (_, key) in keys] pattern = "(%s)" % "|".join(keys) repl = lambda x : mapping[x.group(1)] s = "fooxxxbazyyyquuux" re.subn(pattern, repl, s) With regards to timing - I suspect that as the input string grows, the ratio of the regexp solution to the string solution will grow. If the map grows, the regexp solution seems faster (see below). Python's regular expression state machine is very inefficient with lots of disjunctions since it involves a lot of backtracking - this can be hand optimised for a much greater speed improvement. N = 1000 alphabet = "abcdefghijklmnopqrstuvwxyz" make_token = lambda : "".join(sample(alphabet, 5)) mapping = dict([(make_token(), make_token()) for i in range(N)]) keys = [(len(key), key) for key in mapping.keys()] keys.sort(reverse=True) keys = [key for (_, key) in keys] pattern = "(%s)" % "|".join(keys) replace_strs = ["replace('%s', '%s')" % (key, mapping[key]) for key in keys] command = "s." + ".".join(replace_strs) setup = """ import re regexp = re.compile("%s") repl = lambda x : mapping[x.group(1)] s = "fooxxxbazyyyquuux" """ % pattern t1 = Timer("regexp.subn(repl, s)", setup) t2 = Timer(command, "s = 'fooxxxbazyyyquuux'") res1 = sum(t1.repeat(number=1000)) res2 = sum(t2.repeat(number=1000)) print res1 / res2 on my machine this comes to >>> 0.316323109737 -- http://mail.python.org/mailman/listinfo/python-list