On Oct 12, 7:05 am, Kurt Peters <[EMAIL PROTECTED]> wrote: > I'm using the code below to read a pdf document, and it has no line feeds > or carriage returns in the imported text. I'm therefore trying to just > replace the symbol that looks like it would be an end of line (found by > examining the characters in the "for loop") unichr(167). > Unfortunately, the replace isn't working, does anyone know what I'm > doing wrong? I tried a number of things so I left comments in place as a > subset of the bunch of things I tried to no avail.
This is the first time I've ever looked inside a PDF file, and *only* one file, but: import pyPdf, sys filename = sys.argv[1] doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in range(doc.getNumPages()): page = doc.getPage(pageno) textu = page.extractText() print "pageno", pageno print type(textu) print repr(textu) gives me <type 'unicode'> and text with lots of \n at places where you'd expect them. The only problem I can see is that where I see (and expect) quotation marks (U+201C and U+201D) when viewing the file with Acrobat Reader, the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes and apostrophes. I had a bit of a poke around: 1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates into \x93 and \x94). 2. Then pyPdf appears to push these through a fixed transformation table (_pdfDocEncoding in generic.py) and they become \ufb01 and \ufb02. 3. However: |>>> '\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as expected |>>> AFAICT there is only one reference to encoding in the pyPdf docs: "if pyPdf was unable to decode the string's text encoding" ... Cheers, John -- http://mail.python.org/mailman/listinfo/python-list