Thank you Rusi and Christian! So it sounds like I should read the pdf data in as binary:
-------------------- import os pdfPath = '~/Desktop/test.pdf' colorlistData = '' with open(os.path.expanduser(pdfPath), 'rb') as f: for i in f: if 'XYZ:colorList' in i: colorlistData = i.split('XYZ:colorList')[1] break print(colorlistData) -------------------- This gives me the error: TypeError: Type str doesn't support the buffer API I admit I know nothing about binary, except it's ones and zeroes. Is there a way to read it in as binary, convert it to ascii/unicode, and then somehow split it by newline characters so that I can pull the appropriate metadata lines out? For example, XYZ:colorList="DarkBlue,Yellow" Thanks! Jay -- > Most of the PDF objects are therefore not encoded. It is, however, > possible to include a PDF into another PDF and to encode it, but that's > a rare case. Therefore the metadata can usually be read in text mode. > However, to correctly find all objects, the xref-table indexes offsets > into the PDF. It must be treated binary in any case, and that's the > funny reason for the first 3 characters of the PDF - they must include > characters with the 8th bit set, such that FTP applications treat it as > binary. > Christian -- http://mail.python.org/mailman/listinfo/python-list