Hi Friedrich, On 03/09/15 16:40, Friedrich Rentsch wrote: > > On 09/02/2015 04:03 AM, Rob Hills wrote: >> Hi, >> >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn = dataIn.replace('<','<') # Tidy up < character >> dataIn = dataIn.replace('>','>') # Tidy up < character >> dataIn = dataIn.replace('o','o') # No idea why but lots of >> these: convert to 'o' character >> dataIn = dataIn.replace('f','f') # .. and these: convert to >> 'f' character >> dataIn = dataIn.replace('e','e') # .. 'e' >> dataIn = dataIn.replace('O','O') # .. 'O' >> >> These statements transform my data correctly, but the list of statements >> grows as I test the data so I thought it made sense to store the >> replacement mappings in a file, read them into a dict and loop through >> that to do the cleaning up, like this: >> >> with open(fileName, 'r+t', encoding='utf-8') as mapFile: >> for line in mapFile: >> line = line.strip() >> try: >> if (line) and not line.startswith('#'): >> line = line.split('#')[:1][0].strip() # trim >> any trailing comments >> name, value = line.split('=') >> name = name.strip() >> self.filterMap[name]=value.strip() >> except: >> self.logger.error('exception occurred parsing >> line [{0}] in file [{1}]'.format(line, fileName)) >> raise >> >> Elsewhere, I use the following code to do the actual cleaning up: >> >> def filter(self, dataIn): >> if dataIn: >> for token, replacement in self.filterMap.items(): >> dataIn = dataIn.replace(token, replacement) >> return dataIn >> >> >> My mapping file contents look like this: >> >> \r = \\n >> â = " >> < = < >> > = > >> ' = ' >> F = F >> o = o >> f = f >> e = e >> O = O >> >> This all works "as advertised" */except/* for the '\r' => '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >> * change the way I represent '\r' and '\\n' in my mapping file; or >> * transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> >> TIA, >> >> > > I have had this problem too and can propose a solution ready to run > out of my toolbox: > > > class editor: > > def compile (self, replacements): > targets, substitutes = zip (*replacements) > re_targets = [re.escape (item) for item in targets] > re_targets.sort (reverse = True) > self.targets_set = set (targets) > self.table = dict (replacements) > regex_string = '|'.join (re_targets) > self.regex = re.compile (regex_string, re.DOTALL) > > def edit (self, text, eat = False): > hits = self.regex.findall (text) > nohits = self.regex.split (text) > valid_hits = set (hits) & self.targets_set # Ignore targets > with illegal re modifiers. > if valid_hits: > substitutes = [self.table [item] for item in hits if item > in valid_hits] + [] # Make lengths equal for zip to work right > if eat: > output = ''.join (substitutes) > else: > zipped = zip (nohits, substitutes) > output = ''.join (list (reduce (lambda a, b: a + b, > [zipped][0]))) + nohits [-1] > else: > if eat: > output = '' > else: > output = input > return output > > >>> substitutions = ( > ('\r', '\n'), > ('<', '<'), > ('>', '>'), > ('o', 'o'), > ('f', 'f'), > ('e', 'e'), > ('O', 'O'), > ) > > Order doesn't matter. Add new ones at the end. > > >>> e = editor () > >>> e.compile (substitutions) > > A simple way of testing is running the substitutions through the editor > > >>> print e.edit (repr (substitutions)) > (('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e', > 'e'), ('O', 'O')) > > The escapes need to be tested separately > > >>> print e.edit ('abc\rdef') > abc > def > > Note: This editor's compiler compiles the substitution list to a > regular expression which the editor uses to find all matches in the > text passed to edit. There has got to be a limit to the size of a text > which a regular expression can handle. I don't know what this limit > is. To be on the safe side, edit a large text line by line or at least > in sensible chunks. > > Frederic >
Thanks for the suggestion. I had originally done a simple set of hard-coded str.replace() functions which worked fine and are fast enough for me not to have to delve into the complexity and obscurity of regex. I had also contemplated simply declaring my replacement dict in its own .py file and then importing it. I ended up stubbornly pursuing the idea of loading everything from a text file just because I didn't understand why it wasn't working. Cheers, -- Rob Hills Waikiki, Western Australia -- https://mail.python.org/mailman/listinfo/python-list