On Mon, 12 May 2014 10:35:53 -0700, scottcabit wrote: > On Friday, May 9, 2014 8:12:57 PM UTC-4, Steven D'Aprano wrote: > >> Good: >> >> >> >> fStr = re.sub(b'‒', b'-', fStr) >> >> > Doesn't work...the document has been verified to contain endash and > emdash characters, but this does NOT replace them.
You may have missed my follow up post, where I said I had not noticed you were operating on a binary .doc file. The text content of your doc file might look like: This – is an n-dash. when viewed in Microsoft Word, but that is not the contents on disk. Word .doc files are a proprietary, secret binary format. Apart from the rest of the document structure and metadata, the text itself could be stored any old way. We don't know how. Microsoft surely knows how it is stored, but are unlikely to tell. A few open source projects like OpenOffice, LibreOffice and Abiword have reverse-engineered the file format. Taking a wild guess, I think it could be something like: This \xe2\x80\x93 is an n-dash. or possibly: \x00T\x00h\x00i\x00s\x00 \x13\x00 \x00i\x00s\x00 \x00a \x00n\x00 \x00n\x00-\x00d\x00a\x00s\x00h\x00. or: This {EN DASH} is an n-dash. or: x\x9c\x0b\xc9\xc8,V\xa8v\xf5Spq\x0c\xf6\xa8U\x00r\x12 \xf3\x14\xf2tS\x12\x8b3\xf4\x00\x82^\x08\xf8 (that last one is the text passed through the zlib compressor), but really I'm just making up vaguely conceivable possibilities. If you're not willing or able to use a full-blown doc parser, say by controlling Word or LibreOffice, the other alternative is to do something quick and dirty that might work most of the time. Open a doc file, or multiple doc files, in a hex editor and *hopefully* you will be able to see chunks of human-readable text where you can identify how en-dashes and similar are stored. -- Steven D'Aprano -- https://mail.python.org/mailman/listinfo/python-list