On 2014-05-09 12:51, scottca...@gmail.com wrote: > here is a snippet of code that opens a file (fn contains the > path\name) and first tried to replace all endash, emdash etc > characters with simple dash characters, before doing a search. But > the replaces are not having any effect. Obviously a syntax > problem....wwhat silly thing am I doing wrong? > > fn = 'z:\Documentation\Software' > def processdoc(fn,outfile): > fStr = open(fn, 'rb').read() > re.sub(b'‒','-',fStr) > re.sub(b'–','-',fStr) > re.sub(b'—','-',fStr) > re.sub(b'―','-',fStr) > re.sub(b'⸺','-',fStr) > re.sub(b'⸻','-',fStr) > re.sub(b'-','-',fStr) > re.sub(b'­','-',fStr)
A Word doc (as your subject mentions) is a binary format. There's the older .doc and the newer .docx (which is actually a .zip file with a particular content-structure renamed to .docx). Your example doesn't show the extension, so it's hard to tell whether you're working with the old format or the new format. That said, a simple replacement *certainly* won't work for a .docx file, as you'd have to uncompress the contents, open up the various files inside, perform the replacements, then zip everything back up, and save the result back out. For the older .doc file, it's a binary format, so even if you can successfully find & swap out sequences of 7 chars for a single char, it might screw up the internal offsets, breaking your file. Additionally, I vaguely remember sparring with them using some 16-bit wide characters in .doc files so you might have to search for atrocious things like b"\x00&\x00#\x00x\x002\x000\x001\x002" (each character being prefixed with "\x00". -tkc -- https://mail.python.org/mailman/listinfo/python-list