Re: io module and pdf question

Dave Angel Tue, 25 Jun 2013 19:15:19 -0700

On 06/25/2013 12:15 PM, [email protected] wrote:

Thank you Rusi and Christian!

Something I don't think was mentioned was that reading a text file inPython 3, and specifying latin-1, will work simply because everypossible 8-bit byte is a character in Latin-1 That doesn't mean thatthose characters you get have any connection with the real meaning ofthe file.

So it sounds like I should read the pdf data in as binary:

--------------------
import os

pdfPath = '~/Desktop/test.pdf'

colorlistData = ''

with open(os.path.expanduser(pdfPath), 'rb') as f:
     for i in f:
         if 'XYZ:colorList' in i:
             colorlistData = i.split('XYZ:colorList')[1]
             break

print(colorlistData)
--------------------

This gives me the error:
TypeError: Type str doesn't support the buffer API

That's just a tiny piece of the error. Post the full traceback, whichshows the line that fails, and what called it, and so on. In this caseI'd guess that the line:

    for i in f:

is failing since that mechanism is for reading lines in a text file.For reading streams of bytes, you have the read() method, where yousupply your own count.


I admit I know nothing about binary, except it's ones and zeroes.  Is there a 
way to read it in as binary, convert it to ascii/unicode,

That makes no sense without knowing what the binary data represents. ItMIGHT be that pieces of it will actually be valid ascii, or validunicode (encoded with some encoding). But you would have to ask theauthor, or look up the spec for that particular binary file format.

I'm not familiar at all with how PDF's are encoded, so I don't know whatthe possibilities.

One hacky approach is to use the strings utility (standard on mostversions of Unix/Linux) to basically throw out most of the file, keepingonly those portions of it that happen to look like reasonable ASCII. Bydefault it captures each consecutive sequence of at least 4 ASCIIprintable characters, and puts a newline to represent one or moreunprintable or non-ASCII characters between them.

If you cannot find strings (or string) for your OS, you can write thefilter yourself.

But much better would be to use some library that understood the PDFformat rules.

and then somehow split it by newline characters so that I can pull the appropriate 
metadata lines out?  For example, XYZ:colorList="DarkBlue,Yellow"

Thanks!

Jay

--

Most of the PDF objects are therefore not encoded. It is, however,
possible to include a PDF into another PDF and to encode it, but that's
a rare case. Therefore the metadata can usually be read in text mode.
However, to correctly find all objects, the xref-table indexes offsets
into the PDF. It must be treated binary in any case, and that's the
funny reason for the first 3 characters of the PDF - they must include
characters with the 8th bit set, such that FTP applications treat it as
binary.

        Christian



--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: io module and pdf question

Reply via email to