Fixing PDF EOF Errors with PyPDF
Hey all, I'm trying to read a library of my company's PDFs, but about a third of them can't be opened. PyPDF (v1.12) spits out this error: pyPdf.utils.PdfReadError: EOF marker not found I searched for the answer via google, but all I found was this link: http://lindaocta.com/?tag=pypdf. She suggests fixing the problem by appending an EOF marker like so: def fixPdf(pdfFile): try: fileOpen = file(pdfFile, "a") fileOpen.write("%%EOF") fileOpen.close() return "Fixed" except Exception, e: return "Unable to open file: %s with error: %s" % (pdfFile, str(e)) Which appears to successfully append all of the files, as the exception is never triggered and "Fixed" always returned, but subsequent attempts to open the files all failed. Yet all of those files can be open successfully with Adobe Acrobat Reader. Is this code inorrect or is there some other way to correct this error? Or does the code depend on the system? (I'm using Windows XP, but I believe the author was using a *nix) Sincerely, Brett Bowman -- http://mail.python.org/mailman/listinfo/python-list
Cannot Remove File: Device or resource busy
I'm spawning a subprocess to fix some formating errors with a library of PDFs with pdftk: try: sp = subprocess.Popen('pdftk.exe "%s" output %s' % (pdfFile, outputFile)) sp.wait() del sp except Exception, e: return "Unable to open file: %s with error: %s" % (pdfFile, str(e)) And then I test the result: try: pdf_handle = open(outputFile, "rb") pdf_pypdf = PdfFileReader(pdf_handle) del pdf_pypdf del pdf_handle except Exception, e: return "Unable to open file: %s with error: %s" % (outputFile, str(e)) Both of which appear to work. But when I try to delete the original pdfFile, I get an error message saying that the file is still in use. if I use: sp = subprocess.Popen('rm "%s"' % pdfFile) sp.wait() I get the message - the standard error message from rm and if I use: cwd = os.getcwd() os.remove(cwd + "\\" + pdfFile) I get "WindowsError: [Error 32]" saying much the same thing. What am I missing? Any suggestions would be appreciated. Details: Python 2.6 Windows XP Sincerely, Brett Bowman -- http://mail.python.org/mailman/listinfo/python-list
Re: Cannot Remove File: Device or resource busy
Good ideas, but I've tried them already: -No del command, or replacing it with a set-to-null, neither solve my file access problem. -PdfFileReader has no close() function, and causes an error. Weird, but true. -pdf_handle.close() on the other hand, fails to solve the problem. On Tue, Nov 16, 2010 at 11:25 PM, Dennis Lee Bieber wrote: > On Tue, 16 Nov 2010 17:37:10 -0800, Brett Bowman > declaimed the following in gmane.comp.python.general: > > > > > And then I test the result: > > try: > > pdf_handle = open(outputFile, "rb") > > pdf_pypdf = PdfFileReader(pdf_handle) > > del pdf_pypdf > > del pdf_handle > > except Exception, e: > > return "Unable to open file: %s with error: %s" % (outputFile, > > str(e)) > > > You seem enamored of "del", which is something I've only used for > special purposes, and even then rarely -- binding a null object to the > name is just as effective for most uses. > >While the common Python does garbage collect objects when the > reference count goes to zero, there is no real guarantee of this. > >I'd replace that >del pdf_handle > whit >pdf_handle.close() > > -- >Wulfraed Dennis Lee Bieber AF6VN >wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/ > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list
Subprocess Call works on Windows, but not Ubuntu
I ran into an interesting problem trying to spawn a subprocess, so I thought I'd ask if the experts could explain it to me. I'm spawning a subprocess to run "pdf2txt.py", which is a tool that is distributed with PDFminer to do moderately advanced text-dumps of PDFs. Yet when I run the same code on my two dev machines - one Win XP, the other Ubuntu 10.04 or 10.10 - it only works on the former and not the later. And its not terribly complicated code. # Code Start sp_line = 'python pdf2txt.py -p 1 -o %s "%s"' % ('temp.out', pdf_filename) print sp_line sp = subprocess.Popen(sp_line) sp.wait() with open('temp.out', 'r') as pdf_handle: #Do stuff to read the file The output from the print statements reads: python pdf2txt.py -p 1 -o temp.out "Aarts et al (2009).pdf" That command works on both systems when copied directly to the command-line, and the python script it is a part of works on the Windows machine, but I can't the script to work on Ubuntu for the life of me. What am I missing? /b/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Subprocess Call works on Windows, but not Ubuntu
Ah, that fixed it. Thank you. On Tue, Nov 23, 2010 at 11:37 AM, Chris Rebert wrote: > On Tue, Nov 23, 2010 at 11:28 AM, Brett Bowman wrote: > > I ran into an interesting problem trying to spawn a subprocess, so I > thought > > I'd ask if the experts could explain it to me. I'm spawning a subprocess > to > > run "pdf2txt.py", which is a tool that is distributed with PDFminer to do > > moderately advanced text-dumps of PDFs. Yet when I run the same code on > my > > two dev machines - one Win XP, the other Ubuntu 10.04 or 10.10 - it only > > works on the former and not the later. And its not terribly complicated > > code. > > # Code Start > > sp_line = 'python pdf2txt.py -p 1 -o %s "%s"' % ('temp.out', > pdf_filename) > > print sp_line > > sp = subprocess.Popen(sp_line) > > > python pdf2txt.py -p 1 -o temp.out "Aarts et al (2009).pdf" > > That command works on both systems when copied directly to the > command-line, > > and the python script it is a part of works on the Windows machine, but I > > can't the script to work on Ubuntu for the life of me. What am I > missing? > > Quoting the docs (for the Nth time; emphasis added): > """ > On Unix, with shell=False (default): args should normally be a > sequence. ***If a string is specified for args***, it will be used as > the name or path of the program to execute; ***this will only work if > the program is being given no arguments.*** > """ http://docs.python.org/library/subprocess.html#subprocess.Popen > > Fixed version: > sp_args = ['python', 'pdf2txt.py', '-p', '1', '-o', 'temp.out', > pdf_filename] > sp = subprocess.Popen(sp_args) > > Cheers, > Chris > -- > http://blog.rebertia.com > -- http://mail.python.org/mailman/listinfo/python-list
Copy Protected PDFs and PIL
I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs for my company, but a few hundred of them are copy protected. To make matters worse, I can't seem to trap the error it causes: whenever it happens PIL throws a "FATAL PDF disallows copying" message and dies. An automated way to snap a picture of the PDFs would be ideal, but I'd settle for a way to skip over them without crashing my program. Any tips? Brett Bowman Bioinformatics Associate Cibus LLC -- http://mail.python.org/mailman/listinfo/python-list
Re: Copy Protected PDFs and PIL
Windows currently, though I also have a Linux box running Ubuntu if need be. On Thu, Nov 11, 2010 at 12:28 PM, Brett Bowman wrote: > I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs > for my company, but a few hundred of them are copy protected. To make > matters worse, I can't seem to trap the error it causes: whenever it happens > PIL throws a "FATAL PDF disallows copying" message and dies. An automated > way to snap a picture of the PDFs would be ideal, but I'd settle for a way > to skip over them without crashing my program. > > Any tips? > > Brett Bowman > Bioinformatics Associate > Cibus LLC > -- http://mail.python.org/mailman/listinfo/python-list
Re: Copy Protected PDFs and PIL
To answer various question: MRAB - I've tried worker threads, and it kills the thread only and not the program as a whole. I could use that as a work-around, but I would prefer something more direct, in case other problems arise. Steve Holden - A traceback sounds like a great idea, but I don't know how to go about it, or know what is involved. Could you suggest a tutorial I could follow? Emile van Sebille - a Try/Except block was the first thing I tried, and it still dies with a fatal error, even if I use a generic Except Robert Kern - A whoops, good catch. I meant to say gfx and swftools. I'm using PIL to modify the images once I get a PNG from swftools, and I mis-spoke. The code in question is: import gfx print "1" doc = gfx.open("pdf", MY_FILE) print "2" page1 = doc.getPage(1) print "3" g_img = gfx.ImageList() print "4" g_img.startpage(a_page.width,a_page.height) print "5" a_page.render(g_img) print "6" g_img.endpage() print "7" g_img.save(TEMP_PNG) which prints the following: 1 2 3 4 5 FATAL PDF disallows copying Any help or suggestions would be appreciated. /b/ On Thu, Nov 11, 2010 at 12:28 PM, Brett Bowman wrote: > I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs > for my company, but a few hundred of them are copy protected. To make > matters worse, I can't seem to trap the error it causes: whenever it happens > PIL throws a "FATAL PDF disallows copying" message and dies. An automated > way to snap a picture of the PDFs would be ideal, but I'd settle for a way > to skip over them without crashing my program. > > Any tips? > > Brett Bowman > Bioinformatics Associate > Cibus LLC > -- http://mail.python.org/mailman/listinfo/python-list