Hi, my name is david. I need to read information from .pdf files and convert to .txt files, and I have to do this on python, I have been looking for libraries on python and the pdftools seems to be the solution, but I do not know how to use them well, this is the example that I found on the internet is:
from pdftools.pdffile import PDFDocument from pdftools.pdftext import Text def contents_to_text (contents): for item in contents: if isinstance (item, type ([])): for i in contents_to_text (item): yield i elif isinstance (item, Text): yield item.text doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf") n_pages = doc.count_pages () text = [] for n_page in range (1, (n_pages+1)): print "Page", n_page page = doc.read_page (n_page) contents = page.read_contents ().contents text.extend (contents_to_text (contents)) print "".join (text) the problem is that on some pdf´s it generates join words and In spanish the "acentos" in words like: "camión" goes to --> cami/86n or "IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange characters if someone knows how to use the pdftools and can help me it makes me very happy. Another thing is that I can see the letters readden from .pdf on the screen, but I do not know how to create a file and save this information inside the file a .txt Sorry for my english. Thanks for all. -- http://mail.python.org/mailman/listinfo/python-list