superpollo wrote:
hi.
what is the most pythonic way to substitute substrings?
eg: i want to apply:
foo --> bar
baz --> quux
quuux --> foo
so that:
fooxxxbazyyyquuux --> barxxxquuxyyyfoo
bye
Third attempt. Clearly something doesn't work right. My code gets
clipped on the way up. I have to send it as an attachment. Here's again
what it does:
>>> substitutions = (('foo', 'bar'), ('baz', 'quux'), ('quuux',
'foo')) # Sequence of doublets
>>> T = Translator (substitutions) # Compile substitutions -> translator
>>> s = 'fooxxxbazyyyquuux' # Your source string
>>> d = 'barxxxquuxyyyfoo' # Your destination string
>>> print T (s)
barxxxquuxyyyfoo
>>> print T (s) == d
True
Code attached
Regards
Frederic
class Translator:
r"""
Will translate any number of targets, handling them correctly
if some overlap.
Making Translator
T = Translator (definitions, [eat = 1])
'definitions' is a sequence of pairs: ((target,
substitute),(t2, s2), ...)
'eat = True' will make an extraction filter that lets
only the replaced targets pass.
Definitions example:
(('a','A'),('b','B'),('ab','ab'),('abc','xyz'),
('\x0c', 'page break'), ('\r\n','\n'), (' ','\t'))
# ('ab','ab') see Tricks.
Order doesn't matter.
Running
translation = T (source)
Tricks
Deletion: ('target', '')
Exception: (('\n',''), ('\n\n','\n\n')) # Eat LF
except paragraph breaks.
Exception: (('\n', '\r\n'), ('\r\n',\r\n')) # Unix to
DOS, would leave DOS unchanged
Translation cascade:
# Unwrap paragraphs, Unix or DOS, restoring
inter-word space if missing,
Mark_LF = Translator
((('\n','+LF+'),('\r\n','+LF+'),('\n\n','\n\n'),('\r\n\r\n','\r\n\r\n')))
# Pick any positively identifiable mark for end
of lines in either Unix or MS-DOS.
Single_Space_Mark = Translator (((' +LF+', '
'),('+LF+', ' '),('-+LF+', '')))
no_lf_text = Single_Space_Mark (Mark_LF (text))
Translation cascade:
# Nested calls
reptiles = T_latin_english (T_german_latin
(reptilien))
Limitations
1. The number of substitutions and the maximum size of
input depends on the respective
capabilities of the Python re module.
2. Regular expressions will not work as such.
Author:
Frederic Rentsch (i...@anthra-norell.ch).
"""
def __init__ (self, definitions, eat = 0):
'''
definitions: a sequence of pairs of strings. ((target,
substitute), (t, s), ...)
eat: False (0) means translate: unaffected data passes
unaltered.
True (1) means extract: unaffected data doesn't
pass (gets eaten).
Extraction filters typically require substitutes
to end with some separator,
else they fuse together. (E.g. ' ', '\t' or '\n')
'eat' is an attribute that can be switched anytime.
'''
self.eat = eat
self.compile_sequence_of_pairs (definitions)
def compile_sequence_of_pairs (self, definitions):
'''
Argument 'definitions' is a sequence of pairs:
(('target 1', 'substitute 1'), ('t2', 's2'), ...)
Order doesn't matter.
'''
import re
self.definitions = definitions
targets, substitutes = zip (*definitions)
re_targets = [re.escape (item) for item in targets]
re_targets.sort (reverse = True)
self.targets_set = set (targets)
self.table = dict (definitions)
regex_string = '|'.join (re_targets)
self.regex = re.compile (regex_string, re.DOTALL)
def __call__ (self, s):
hits = self.regex.findall (s)
nohits = self.regex.split (s)
valid_hits = set (hits) & self.targets_set # Ignore targets
with illegal re modifiers.
if valid_hits:
substitutes = [self.table [item] for item in hits if
item in valid_hits] + [] # Make lengths equal for zip to work right
if self.eat:
return ''.join (substitutes)
else:
zipped = zip (nohits, substitutes)
return ''.join (list (reduce (lambda a, b: a +
b, [zipped][0]))) + nohits [-1]
else:
if self.eat:
return ''
else:
return s
--
http://mail.python.org/mailman/listinfo/python-list