On Sat, May 10, 2014 at 5:51 AM, <scottca...@gmail.com> wrote: > But the replaces are not having any effect. Obviously a syntax > problem....wwhat silly thing am I doing wrong? > > Thanks! > > fn = 'z:\Documentation\Software' > def processdoc(fn,outfile): > fStr = open(fn, 'rb').read() > re.sub(b'‒','-',fStr) > re.sub(b'–','-',fStr) > re.sub(b'—','-',fStr) > re.sub(b'―','-',fStr) > re.sub(b'⸺','-',fStr) > re.sub(b'⸻','-',fStr) > re.sub(b'-','-',fStr) > re.sub(b'­','-',fStr)
I can see several things that might be wrong, but it's hard to say what *is* wrong without trying it. 1) Is the file close enough to text that you can even do this sort of parsing? You say it's an MS Word file; that, unfortunately, could mean a lot of things. Some of the newer formats are basically zipped XML, so translations like this won't work. Other forms of Word document may be closer to text, but you majorly risk corrupting the binary content. 2) How are characters represented? Are they actually stored in the file with ampersands, hashes, etc? Your source strings are all seven bytes long, and will look for exactly those bytes. There must be some form of character encoding used; possibly, instead of the &#x notation, you need to UTF-8 or UTF-16LE encode the characters to look for. 3) You're doing simple string replacements using regular expressions. I don't think any of your symbols here is a metacharacter, but I might be wrong. If you're simply replacing one stream of bytes with another, don't use regex at all, just use string replacement. 4) There's nothing in your current code to actually write the contents anywhere. You do all the changes and then do nothing with it. Or is this just part of the code? 5) Similarly, there's nothing in this fragment that actually calls processdoc(). Did you elide that? The fragment you wrote will do a whole lot of nothing, on its own. 6) There's no file extension on your input file name; be sure you really have the file you want, and not (for instance) a directory. Or if you need to iterate over all the files in a directory, you'll need to do that explicitly. 7) This one isn't technically a problem, but it's a risk. The string 'z:\Documentation\Software' has two backslash escapes \D and \S, which the parser fails to recognize, and therefore passes through literally. So it works, currently. However, if you were to change the path to, say, 'z:\Documentation\backups', then it would suddenly fail. There are several solutions to this: 7a) fn = r'z:\Documentation\Software' 7b) fn = 'z:\\Documentation\\Software' 7c) fn = 'z:/Documentation/Software' Hope that helps some, at least! A more full program would be easier to work with. ChrisA -- https://mail.python.org/mailman/listinfo/python-list