In your original code:

  textu.replace(unichr(167),'\n')

as Dennis suggested (but maybe you were distracted by his 'fn' replacement, so I'll leave it out):

  textu = textu.replace(unichr(167),'\n')

.replace does not modify the string in place. It returns the modified string, so you have to reassign it.

-Mark

"Kurt Peters" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
Thanks,
clearly though, my "For loop" shows a character using ord(167), and using print repr(textu), it shows the character \xa7 (as does Peter Oten's post). So you can see what I see, here's the document I'm using - the Special Use Airspace document at
http://www.faa.gov/airports_airtraffic/air_traffic/publications/
which is = JO 7400.8P (PDF)

if you just look at page three, it shows those unusual characters.
Once again, using a "simple" replace, doesn't seem to work. I can't seem to figure out how to get it to work, despite all the great posts attempting to shed some light on the subject.

Regards,
Kurt


"John Machin" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
On Oct 12, 7:05 am, Kurt Peters <[EMAIL PROTECTED]> wrote:
I'm using the code below to read a pdf document, and it has no line feeds
or carriage returns in the imported text. I'm therefore trying to just
replace the symbol that looks like it would be an end of line (found by
examining the characters in the "for loop") unichr(167).
Unfortunately, the replace isn't working, does anyone know what I'm
doing wrong? I tried a number of things so I left comments in place as a
subset of the bunch of things I tried to no avail.

This is the first time I've ever looked inside a PDF file, and *only*
one file, but:

import pyPdf, sys
filename = sys.argv[1]
doc = pyPdf.PdfFileReader(open(filename, "rb"))
for pageno in range(doc.getNumPages()):
   page = doc.getPage(pageno)
   textu = page.extractText()
   print "pageno", pageno
   print type(textu)
   print repr(textu)

gives me <type 'unicode'> and text with lots of \n at places where
you'd expect them.

The only problem I can see is that where I see (and expect) quotation
marks (U+201C and U+201D) when viewing the file with Acrobat Reader,
the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes
and apostrophes. I had a bit of a poke around:

1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and
\x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates
into \x93 and \x94).

2. Then pyPdf appears to push these through a fixed transformation
table (_pdfDocEncoding in generic.py) and they become \ufb01 and
\ufb02.

3. However:
|>>> '\x93\x94'.decode('cp1252') # as suspected
|u'\u201c\u201d' # as expected
|>>>

AFAICT there is only one reference to encoding in the pyPdf docs: "if
pyPdf was unable to decode the string's text encoding" ...

Cheers,
John


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to