Thanks, The "distraction" was my problem. I replaced the textu.replace as you suggested and it works fine. Kurt
On Sun, 12 Oct 2008 19:53:09 -0700, Mark Tolonen wrote: > In your original code: > > textu.replace(unichr(167),'\n') > > as Dennis suggested (but maybe you were distracted by his 'fn' > replacement, so I'll leave it out): > > textu = textu.replace(unichr(167),'\n') > > .replace does not modify the string in place. It returns the modified > string, so you have to reassign it. > > -Mark > > "Kurt Peters" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> Thanks, >> clearly though, my "For loop" shows a character using ord(167), and >> using >> print repr(textu), it shows the character \xa7 (as does Peter Oten's >> post). So you can see what I see, here's the document I'm using - the >> Special Use Airspace document at >> http://www.faa.gov/airports_airtraffic/air_traffic/publications/ which >> is = JO 7400.8P (PDF) >> >> if you just look at page three, it shows those unusual characters. Once >> again, using a "simple" replace, doesn't seem to work. I can't seem to >> figure out how to get it to work, despite all the great posts >> attempting to shed some light on the subject. >> >> Regards, >> Kurt >> >> >> "John Machin" <[EMAIL PROTECTED]> wrote in message >> news:42f39e4c- [EMAIL PROTECTED] >> On Oct 12, 7:05 am, Kurt Peters <[EMAIL PROTECTED]> wrote: >>> I'm using the code below to read a pdf document, and it has no line >>> feeds or carriage returns in the imported text. I'm therefore trying >>> to just replace the symbol that looks like it would be an end of line >>> (found by examining the characters in the "for loop") unichr(167). >>> Unfortunately, the replace isn't working, does anyone know what I'm >>> doing wrong? I tried a number of things so I left comments in place as >>> a subset of the bunch of things I tried to no avail. >> >> This is the first time I've ever looked inside a PDF file, and *only* >> one file, but: >> >> import pyPdf, sys >> filename = sys.argv[1] >> doc = pyPdf.PdfFileReader(open(filename, "rb")) for pageno in >> range(doc.getNumPages()): >> page = doc.getPage(pageno) >> textu = page.extractText() >> print "pageno", pageno >> print type(textu) >> print repr(textu) >> >> gives me <type 'unicode'> and text with lots of \n at places where >> you'd expect them. >> >> The only problem I can see is that where I see (and expect) quotation >> marks (U+201C and U+201D) when viewing the file with Acrobat Reader, >> the repr is showing \ufb01 and \ufb02. Similar problems with em-dashes >> and apostrophes. I had a bit of a poke around: >> >> 1. repr(result of FlateDecode) includes *both* the raw bytes \x93 and >> \x94, *and* the octal escapes \\223 and \\224 (which pyPdf translates >> into \x93 and \x94). >> >> 2. Then pyPdf appears to push these through a fixed transformation >> table (_pdfDocEncoding in generic.py) and they become \ufb01 and >> \ufb02. >> >> 3. However: >> |>>> '\x93\x94'.decode('cp1252') # as suspected |u'\u201c\u201d' # as >> expected >> |>>> >> >> AFAICT there is only one reference to encoding in the pyPdf docs: "if >> pyPdf was unable to decode the string's text encoding" ... >> >> Cheers, >> John >> -- http://mail.python.org/mailman/listinfo/python-list