Puzzling PDF

F.R. Sun, 16 Feb 2014 06:03:08 -0800

Hi all,

Struggling to parse bank statements unavailable in sensibledata-transfer formats, I use pdftotext, which solves part of theproblem. The other day I encountered a strange thing, when one singlefigure out of many erroneously converted into letters. Adobe Readerdisplays the figure 50'000 correctly, but pdftotext makes it into"SO'OOO" (The letters "S" as in Susan and "O" as in Otto). One wouldexpect such a mistake from an OCR. However, the statement is not a scan,but is made up of text. Because malfunctions like this put a damper onthe hope to ever have a reliable reader that doesn't requiretime-consuming manual verification, I played around a bit and ended upeven more confused: When I lift the figure off the Adobe display (mark,copy) and paste it into a Python IDLE window, it is again letters (ascii83 and 79), when on the Adobe display it shows correctly as digits. Howcan that be?


Frederic








--
https://mail.python.org/mailman/listinfo/python-list

Puzzling PDF

Reply via email to