On 09/03/2015 06:12 PM, Rob Hills wrote:
Hi Friedrich,

On 03/09/15 16:40, Friedrich Rentsch wrote:
On 09/02/2015 04:03 AM, Rob Hills wrote:
Hi,

I am developing code (Python 3.4) that transforms text data from one
format to another.

As part of the process, I had a set of hard-coded str.replace(...)
functions that I used to clean up the incoming text into the desired
output format, something like this:

      dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
      dataIn = dataIn.replace('&lt;','<') # Tidy up < character
      dataIn = dataIn.replace('&gt;','>') # Tidy up < character
      dataIn = dataIn.replace('&#111;','o') # No idea why but lots of
these: convert to 'o' character
      dataIn = dataIn.replace('&#102;','f') # .. and these: convert to
'f' character
      dataIn = dataIn.replace('&#101;','e') # ..  'e'
      dataIn = dataIn.replace('&#079;','O') # ..  'O'

These statements transform my data correctly, but the list of statements
grows as I test the data so I thought it made sense to store the
replacement mappings in a file, read them into a dict and loop through
that to do the cleaning up, like this:

          with open(fileName, 'r+t', encoding='utf-8') as mapFile:
              for line in mapFile:
                  line = line.strip()
                  try:
                      if (line) and not line.startswith('#'):
                          line = line.split('#')[:1][0].strip() # trim
any trailing comments
                          name, value = line.split('=')
                          name = name.strip()
                          self.filterMap[name]=value.strip()
                  except:
                      self.logger.error('exception occurred parsing
line [{0}] in file [{1}]'.format(line, fileName))
                      raise

Elsewhere, I use the following code to do the actual cleaning up:

      def filter(self, dataIn):
          if dataIn:
              for token, replacement in self.filterMap.items():
                  dataIn = dataIn.replace(token, replacement)
          return dataIn


My mapping file contents look like this:

\r = \\n
“ = &quot;
&lt; = <
&gt; = >
&#039; = &apos;
&#070; = F
&#111; = o
&#102; = f
&#101; = e
&#079; = O

This all works "as advertised" */except/* for the '\r' => '\\n'
replacement. Debugging the code, I see that my '\r' character is
"escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
the file.

I've been googling hard and reading the Python docs, trying to get my
head around character encoding, but I just can't figure out how to get
these bits of code to do what I want.

It seems to me that I need to either:

    * change the way I represent '\r' and '\\n' in my mapping file; or
    * transform them somehow when I read them in

However, I haven't figured out how to do either of these.

TIA,


I have had this problem too and can propose a solution ready to run
out of my toolbox:


class editor:

     def compile (self, replacements):
         targets, substitutes = zip (*replacements)
         re_targets = [re.escape (item) for item in targets]
         re_targets.sort (reverse = True)
         self.targets_set = set (targets)
         self.table = dict (replacements)
         regex_string = '|'.join (re_targets)
         self.regex = re.compile (regex_string, re.DOTALL)

     def edit (self, text, eat = False):
         hits = self.regex.findall (text)
         nohits = self.regex.split (text)
         valid_hits = set (hits) & self.targets_set  # Ignore targets
with illegal re modifiers.
         if valid_hits:
             substitutes = [self.table [item] for item in hits if item
in valid_hits] + []  # Make lengths equal for zip to work right
             if eat:
                 output = ''.join (substitutes)
             else:
                 zipped = zip (nohits, substitutes)
                 output = ''.join (list (reduce (lambda a, b: a + b,
[zipped][0]))) + nohits [-1]
         else:
             if eat:
                 output = ''
             else:
                 output = input
         return output

substitutions = (
     ('\r', '\n'),
     ('&lt;', '<'),
     ('&gt;', '>'),
     ('&#111;', 'o'),
     ('&#102;', 'f'),
     ('&#101;', 'e'),
     ('&#079;', 'O'),
     )

Order doesn't matter. Add new ones at the end.

e = editor ()
e.compile (substitutions)
A simple way of testing is running the substitutions through the editor

print e.edit (repr (substitutions))
(('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e',
'e'), ('O', 'O'))

The escapes need to be tested separately

print e.edit ('abc\rdef')
abc
def

Note: This editor's compiler compiles the substitution list to a
regular expression which the editor uses to find all matches in the
text passed to edit. There has got to be a limit to the size of a text
which a regular expression can handle. I don't know what this limit
is. To be on the safe side, edit a large text line by line or at least
in sensible chunks.

Frederic

Thanks for the suggestion.  I had originally done a simple set of
hard-coded str.replace() functions which worked fine and are fast enough
for me not to have to delve into the complexity and obscurity of regex.

I had also contemplated simply declaring my replacement dict in its own
.py file and then importing it.

I ended up stubbornly pursuing the idea of loading everything from a
text file just because I didn't understand why it wasn't working.

Cheers,

I'm sure you can do it with replace, except if your replacements add up into the dozens, it gets awkward, as you do the substitutions one by one on the whole text. But the real problem with this approach is that you are responsible for doing the replacements in the right sequence. It has been pointed out that order matters. What you want with overlapping targets is upstream takes precedence over downstream and longer over shorter. My suggestion automates this as it automates the construction of the regex.

Peter Otten found one rough sport and one mistake:

substitutes = [self.table [item] for item in hits if item in valid_hits] + []

Adding an empty list is totally useless and can be omitted.

   output = input

Second last line should be; output = text


Regards

Frederic


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to