MRAB wrote: > On 2015-09-02 03:03, Rob Hills wrote: >> Hi, >> >> I am developing code (Python 3.4) that transforms text data from one >> format to another. >> >> As part of the process, I had a set of hard-coded str.replace(...) >> functions that I used to clean up the incoming text into the desired >> output format, something like this: >> >> dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds >> dataIn = dataIn.replace('<','<') # Tidy up < character >> dataIn = dataIn.replace('>','>') # Tidy up < character >> dataIn = dataIn.replace('o','o') # No idea why but lots of >> these: convert to 'o' character dataIn = >> dataIn.replace('f','f') # .. and these: convert to 'f' >> character >> dataIn = dataIn.replace('e','e') # .. 'e' >> dataIn = dataIn.replace('O','O') # .. 'O' >> > The problem with this approach is that the order of the replacements > matters. For example, changing '<' to '<' and then '&' to '&' > can give a different result to changing '&' to '&' and then '<' > to '<'. If you started with the string '&lt;', then the first order > would go '&lt;' => '&lt;' => '<', whereas the second order > would go '&lt;' => '<' => '<'. > >> These statements transform my data correctly, but the list of statements >> grows as I test the data so I thought it made sense to store the >> replacement mappings in a file, read them into a dict and loop through >> that to do the cleaning up, like this: >> >> with open(fileName, 'r+t', encoding='utf-8') as mapFile: >> for line in mapFile: >> line = line.strip() >> try: >> if (line) and not line.startswith('#'): >> line = line.split('#')[:1][0].strip() # trim any >> trailing comments name, value = line.split('=') >> name = name.strip() >> self.filterMap[name]=value.strip() >> except: >> self.logger.error('exception occurred parsing line >> [{0}] in file [{1}]'.format(line, fileName)) raise >> >> Elsewhere, I use the following code to do the actual cleaning up: >> >> def filter(self, dataIn): >> if dataIn: >> for token, replacement in self.filterMap.items(): >> dataIn = dataIn.replace(token, replacement) >> return dataIn >> >> >> My mapping file contents look like this: >> >> \r = \\n >> â = " >> < = < >> > = > >> ' = ' >> F = F >> o = o >> f = f >> e = e >> O = O >> >> This all works "as advertised" */except/* for the '\r' => '\\n' >> replacement. Debugging the code, I see that my '\r' character is >> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from >> the file. >> >> I've been googling hard and reading the Python docs, trying to get my >> head around character encoding, but I just can't figure out how to get >> these bits of code to do what I want. >> >> It seems to me that I need to either: >> >> * change the way I represent '\r' and '\\n' in my mapping file; or >> * transform them somehow when I read them in >> >> However, I haven't figured out how to do either of these. >> > Try ast.literal_eval, although you'd need to make it look like a string > literal first: > > >>> import ast > >>> line = r'\r = \\n' > >>> print(line) > \r = \\n > >>> old, sep, new = line.partition(' = ') > >>> print(old) > \r > >>> print(new) > \\n > >>> ast.literal_eval('"%s"' % old) > '\r' > >>> ast.literal_eval('"%s"' % new) > '\\n' > >>>
There's also codecs.decode(): >>> codecs.decode(r"\r = \\n", "unicode-escape") '\r = \\n' > I wouldn't put the &#...; forms into the mappings file (except for the > ' one) because they can all be recognised and done in code > ('F' is chr(int('070')), for example). Or >>> import html >>> html.unescape("< ö F") '< ö F' Even if you cannot use unescape() directly you might steal the implementation. -- https://mail.python.org/mailman/listinfo/python-list