Fixing PDF EOF Errors with PyPDF

2010-11-14 Thread Brett Bowman
Hey all, I'm trying to read a library of my company's PDFs, but about a
third of them can't be opened.  PyPDF (v1.12) spits out this error:

pyPdf.utils.PdfReadError: EOF marker not found

I searched for the answer via google, but all I found was this link:
http://lindaocta.com/?tag=pypdf.  She suggests fixing the problem by
appending an EOF marker like so:

def fixPdf(pdfFile):
try:
fileOpen = file(pdfFile, "a")
fileOpen.write("%%EOF")
fileOpen.close()
return "Fixed"
except Exception, e:
return "Unable to open file: %s with error: %s" % (pdfFile, str(e))

Which appears to successfully append all of the files, as the exception is
never triggered and "Fixed" always returned,
but subsequent attempts to open the files all failed.  Yet all of those
files can be open successfully with Adobe Acrobat Reader.
Is this code inorrect or is there some other way to correct this error?  Or
does the code depend on the system?
(I'm using Windows XP, but I believe the author was using a *nix)

Sincerely,
Brett Bowman
-- 
http://mail.python.org/mailman/listinfo/python-list


Cannot Remove File: Device or resource busy

2010-11-16 Thread Brett Bowman
I'm spawning a subprocess to fix some formating errors with a library of
PDFs with pdftk:
try:
sp = subprocess.Popen('pdftk.exe "%s" output %s' % (pdfFile,
outputFile))
sp.wait()
del sp
except Exception, e:
return "Unable to open file: %s with error: %s" % (pdfFile, str(e))

And then I test the result:
try:
pdf_handle = open(outputFile, "rb")
pdf_pypdf = PdfFileReader(pdf_handle)
del pdf_pypdf
del pdf_handle
except Exception, e:
return "Unable to open file: %s with error: %s" % (outputFile,
str(e))

Both of which appear to work.  But when I try to delete the original
pdfFile, I get an error message saying that the file is still in use.

if I use:
sp = subprocess.Popen('rm "%s"' % pdfFile)
sp.wait()
I get the message - the standard error message from rm

and if I use:
cwd = os.getcwd()
os.remove(cwd + "\\" + pdfFile)
I get "WindowsError: [Error 32]"  saying much the same thing.

What am I missing?  Any suggestions would be appreciated.

Details:
Python 2.6
Windows XP

Sincerely,
Brett Bowman
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cannot Remove File: Device or resource busy

2010-11-17 Thread Brett Bowman
Good ideas, but I've tried them already:
-No del command, or replacing it with a set-to-null, neither solve my file
access problem.
-PdfFileReader has no close() function, and causes an error.  Weird, but
true.
-pdf_handle.close() on the other hand, fails to solve the problem.

On Tue, Nov 16, 2010 at 11:25 PM, Dennis Lee Bieber
wrote:

> On Tue, 16 Nov 2010 17:37:10 -0800, Brett Bowman 
> declaimed the following in gmane.comp.python.general:
>
> >
> > And then I test the result:
> > try:
> > pdf_handle = open(outputFile, "rb")
> > pdf_pypdf = PdfFileReader(pdf_handle)
> > del pdf_pypdf
> > del pdf_handle
> > except Exception, e:
> > return "Unable to open file: %s with error: %s" % (outputFile,
> > str(e))
> >
> You seem enamored of "del", which is something I've only used for
> special purposes, and even then rarely -- binding a null object to the
> name is just as effective for most uses.
>
>While the common Python does garbage collect objects when the
> reference count goes to zero, there is no real guarantee of this.
>
>I'd replace that
>del pdf_handle
> whit
>pdf_handle.close()
>
> --
>Wulfraed Dennis Lee Bieber AF6VN
>wlfr...@ix.netcom.comHTTP://wlfraed.home.netcom.com/
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Subprocess Call works on Windows, but not Ubuntu

2010-11-23 Thread Brett Bowman
I ran into an interesting problem trying to spawn a subprocess, so I thought
I'd ask if the experts could explain it to me.  I'm spawning a subprocess to
run "pdf2txt.py", which is a tool that is distributed with PDFminer to do
moderately advanced text-dumps of PDFs.  Yet when I run the same code on my
two dev machines - one Win XP, the other Ubuntu 10.04 or 10.10 - it only
works on the former and not the later. And its not terribly complicated
code.

# Code Start
sp_line = 'python pdf2txt.py -p 1 -o %s "%s"' % ('temp.out', pdf_filename)
print sp_line
sp = subprocess.Popen(sp_line)
sp.wait()
with open('temp.out', 'r') as pdf_handle:
#Do stuff to read the file

The output from the print statements reads:
python pdf2txt.py -p 1 -o temp.out "Aarts et al (2009).pdf"

That command works on both systems when copied directly to the command-line,
and the python script it is a part of works on the Windows machine, but I
can't the script to work on Ubuntu for the life of me.  What am I missing?

/b/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Subprocess Call works on Windows, but not Ubuntu

2010-11-23 Thread Brett Bowman
Ah, that fixed it.  Thank you.

On Tue, Nov 23, 2010 at 11:37 AM, Chris Rebert  wrote:

> On Tue, Nov 23, 2010 at 11:28 AM, Brett Bowman  wrote:
> > I ran into an interesting problem trying to spawn a subprocess, so I
> thought
> > I'd ask if the experts could explain it to me.  I'm spawning a subprocess
> to
> > run "pdf2txt.py", which is a tool that is distributed with PDFminer to do
> > moderately advanced text-dumps of PDFs.  Yet when I run the same code on
> my
> > two dev machines - one Win XP, the other Ubuntu 10.04 or 10.10 - it only
> > works on the former and not the later. And its not terribly complicated
> > code.
> > # Code Start
> > sp_line = 'python pdf2txt.py -p 1 -o %s "%s"' % ('temp.out',
> pdf_filename)
> > print sp_line
> > sp = subprocess.Popen(sp_line)
> 
> > python pdf2txt.py -p 1 -o temp.out "Aarts et al (2009).pdf"
> > That command works on both systems when copied directly to the
> command-line,
> > and the python script it is a part of works on the Windows machine, but I
> > can't the script to work on Ubuntu for the life of me.  What am I
> missing?
>
> Quoting the docs (for the Nth time; emphasis added):
> """
> On Unix, with shell=False (default): args should normally be a
> sequence. ***If a string is specified for args***, it will be used as
> the name or path of the program to execute; ***this will only work if
> the program is being given no arguments.***
> """ http://docs.python.org/library/subprocess.html#subprocess.Popen
>
> Fixed version:
> sp_args = ['python', 'pdf2txt.py', '-p', '1', '-o', 'temp.out',
> pdf_filename]
> sp = subprocess.Popen(sp_args)
>
> Cheers,
> Chris
> --
> http://blog.rebertia.com
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Copy Protected PDFs and PIL

2010-11-11 Thread Brett Bowman
I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs for
my company, but a few hundred of them are copy protected.  To make matters
worse, I can't seem to trap the error it causes: whenever it happens PIL
throws a "FATAL PDF disallows copying" message and dies.  An automated way
to snap a picture of the PDFs would be ideal, but I'd settle for a way to
skip over them without crashing my program.

Any tips?

Brett Bowman
Bioinformatics Associate
Cibus LLC
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Copy Protected PDFs and PIL

2010-11-11 Thread Brett Bowman
Windows currently, though I also have a Linux box running Ubuntu if need be.

On Thu, Nov 11, 2010 at 12:28 PM, Brett Bowman  wrote:

> I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs
> for my company, but a few hundred of them are copy protected.  To make
> matters worse, I can't seem to trap the error it causes: whenever it happens
> PIL throws a "FATAL PDF disallows copying" message and dies.  An automated
> way to snap a picture of the PDFs would be ideal, but I'd settle for a way
> to skip over them without crashing my program.
>
> Any tips?
>
> Brett Bowman
> Bioinformatics Associate
> Cibus LLC
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Copy Protected PDFs and PIL

2010-11-12 Thread Brett Bowman
To answer various question:

MRAB -
I've tried worker threads, and it kills the thread only and not the program
as a whole.  I could use that as a work-around, but I would prefer something
more direct, in case other problems arise.

Steve Holden -
A traceback sounds like a great idea, but I don't know how to go about it,
or know what is involved.  Could you suggest a tutorial I could follow?

Emile van Sebille -
a Try/Except block was the first thing I tried, and it still dies with a
fatal error, even if I use a generic Except

Robert Kern -
A whoops, good catch.  I meant to say gfx and swftools.  I'm using PIL to
modify the images once I get a PNG from swftools, and I mis-spoke.
The code in question is:

import gfx
print "1"
doc = gfx.open("pdf", MY_FILE)
print "2"
page1 = doc.getPage(1)
print "3"
g_img = gfx.ImageList()
print "4"
g_img.startpage(a_page.width,a_page.height)
print "5"
a_page.render(g_img)
print "6"
g_img.endpage()
print "7"
g_img.save(TEMP_PNG)

which prints the following:

1
2
3
4
5
FATAL PDF disallows copying

Any help or suggestions would be appreciated.

/b/

On Thu, Nov 11, 2010 at 12:28 PM, Brett Bowman  wrote:

> I'm trying to parse some basic details and a thumbnail from ~12,000 PDFs
> for my company, but a few hundred of them are copy protected.  To make
> matters worse, I can't seem to trap the error it causes: whenever it happens
> PIL throws a "FATAL PDF disallows copying" message and dies.  An automated
> way to snap a picture of the PDFs would be ideal, but I'd settle for a way
> to skip over them without crashing my program.
>
> Any tips?
>
> Brett Bowman
> Bioinformatics Associate
> Cibus LLC
>
-- 
http://mail.python.org/mailman/listinfo/python-list