Davor wrote: > Hi, my name is david. > I need to read information from .pdf files and convert to .txt files, > and I have to do this on python, > I have been looking for libraries on python and the pdftools seems to > be the solution, but I do not know how to use them well, > this is the example that I found on the internet is:
[...] > for n_page in range (1, (n_pages+1)): > print "Page", n_page > page = doc.read_page (n_page) > contents = page.read_contents ().contents > text.extend (contents_to_text (contents)) > > print "".join (text) > > the problem is that on some pdf´s it generates join words and In > spanish the "acentos" > in words like: "camión" goes to --> cami/86n or > "IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange > characters pdftools just extracts the textual data in the file and stores it in Text instances - it doesn't try to interpret or decode the text. I'd like to fix the library so that it does try and decode the text properly and put it into unicode strings, but I don't have the time right now. Remember that text can be stored in PDF files in many different ways, and that the text cannot always be extracted in its original form. > if someone knows how to use the pdftools and can help me it makes me > very happy. > > Another thing is that I can see the letters readden from .pdf on the > screen, but I do not know how to create a file and save this > information inside the file a .txt You need to do something like this: f = open("myfilename", "w").write("".join (text)) > Sorry for my english. Don't worry about it. It's much better than my Spanish will ever be. Sorry I couldn't give you more help with this. You may find that the other tools mentioned by people in this thread will do what you need better than pdftools can at the moment. David -- http://mail.python.org/mailman/listinfo/python-list