Re: Reading \n unescaped from a file

Friedrich Rentsch Thu, 03 Sep 2015 01:44:54 -0700


On 09/02/2015 04:03 AM, Rob Hills wrote:

Hi,

I am developing code (Python 3.4) that transforms text data from one
format to another.

As part of the process, I had a set of hard-coded str.replace(...)
functions that I used to clean up the incoming text into the desired
output format, something like this:

     dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
     dataIn = dataIn.replace('&lt;','<') # Tidy up < character
     dataIn = dataIn.replace('&gt;','>') # Tidy up < character
     dataIn = dataIn.replace('&#111;','o') # No idea why but lots of these: 
convert to 'o' character
     dataIn = dataIn.replace('&#102;','f') # .. and these: convert to 'f' 
character
     dataIn = dataIn.replace('&#101;','e') # ..  'e'
     dataIn = dataIn.replace('&#079;','O') # ..  'O'

These statements transform my data correctly, but the list of statements
grows as I test the data so I thought it made sense to store the
replacement mappings in a file, read them into a dict and loop through
that to do the cleaning up, like this:

         with open(fileName, 'r+t', encoding='utf-8') as mapFile:
             for line in mapFile:
                 line = line.strip()
                 try:
                     if (line) and not line.startswith('#'):
                         line = line.split('#')[:1][0].strip() # trim any 
trailing comments
                         name, value = line.split('=')
                         name = name.strip()
                         self.filterMap[name]=value.strip()
                 except:
                     self.logger.error('exception occurred parsing line [{0}] 
in file [{1}]'.format(line, fileName))
                     raise

Elsewhere, I use the following code to do the actual cleaning up:

     def filter(self, dataIn):
         if dataIn:
             for token, replacement in self.filterMap.items():
                 dataIn = dataIn.replace(token, replacement)
         return dataIn


My mapping file contents look like this:

\r = \\n
â = &quot;
&lt; = <
&gt; = >
&#039; = &apos;
&#070; = F
&#111; = o
&#102; = f
&#101; = e
&#079; = O

This all works "as advertised" */except/* for the '\r' => '\\n'
replacement. Debugging the code, I see that my '\r' character is
"escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
the file.

I've been googling hard and reading the Python docs, trying to get my
head around character encoding, but I just can't figure out how to get
these bits of code to do what I want.

It seems to me that I need to either:

   * change the way I represent '\r' and '\\n' in my mapping file; or
   * transform them somehow when I read them in

However, I haven't figured out how to do either of these.

TIA,

I have had this problem too and can propose a solution ready to run outof my toolbox:



class editor:

    def compile (self, replacements):
        targets, substitutes = zip (*replacements)
        re_targets = [re.escape (item) for item in targets]
        re_targets.sort (reverse = True)
        self.targets_set = set (targets)
        self.table = dict (replacements)
        regex_string = '|'.join (re_targets)
        self.regex = re.compile (regex_string, re.DOTALL)

    def edit (self, text, eat = False):
        hits = self.regex.findall (text)
        nohits = self.regex.split (text)

valid_hits = set (hits) & self.targets_set # Ignore targetswith illegal re modifiers.

        if valid_hits:

substitutes = [self.table [item] for item in hits if itemin valid_hits] + [] # Make lengths equal for zip to work right

            if eat:
                output = ''.join (substitutes)
            else:
                zipped = zip (nohits, substitutes)

output = ''.join (list (reduce (lambda a, b: a + b,[zipped][0]))) + nohits [-1]

        else:
            if eat:
                output = ''
            else:
                output = input
        return output

>>> substitutions = (
    ('\r', '\n'),
    ('&lt;', '<'),
    ('&gt;', '>'),
    ('&#111;', 'o'),
    ('&#102;', 'f'),
    ('&#101;', 'e'),
    ('&#079;', 'O'),
    )

Order doesn't matter. Add new ones at the end.

>>> e = editor ()
>>> e.compile (substitutions)

A simple way of testing is running the substitutions through the editor

>>> print e.edit (repr (substitutions))

(('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e','e'), ('O', 'O'))


The escapes need to be tested separately

>>> print e.edit ('abc\rdef')
abc
def

Note: This editor's compiler compiles the substitution list to a regularexpression which the editor uses to find all matches in the text passedto edit. There has got to be a limit to the size of a text which aregular expression can handle. I don't know what this limit is. To be onthe safe side, edit a large text line by line or at least in sensiblechunks.


Frederic

--
https://mail.python.org/mailman/listinfo/python-list

Re: Reading \n unescaped from a file

Reply via email to