On 27/09/2010 01:39, flebber wrote:
On Sep 27, 9:38 am, "w.g.sned...@gmail.com"<w.g.sned...@gmail.com>
wrote:
On Sep 26, 7:10 pm, flebber<flebber.c...@gmail.com>  wrote:

I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...

import pyPdf

def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"
     # Load PDF into pyPDF
     pdf = pyPdf.PdfFileReader(file(path, "rb"))
     # Iterate pages
     for i in range(0, pdf.getNumPages()):
         # Extract text from page and add to content
         content += pdf.getPage(i).extractText() + "\n"
     # Collapse whitespace
     content = " ".join(content.replace(u"\xa0", " ").strip().split())
     return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

Warning (from warnings module):
   File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
     from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
   File "C:/Python26/Pdfread", line 15, in<module>
     print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
   File "C:/Python26/Pdfread", line 6, in getPDFContent
     pdf = pyPdf.PdfFileReader(file(path, "rb"))

--->  IOError: [Errno 2] No such file or directory: 'Components-of-Dot->  
NET.pdf'

Looks like a issue with finding the file.
how do you pass the path?

okay thanks I thought that when I set content here

def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.

     # Load PDF into pyPDF
     pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.

     # Iterate pages
     for i in range(0, pdf.getNumPages()):
         # Extract text from page and add to content
         content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.

     # Collapse whitespace

'content' now contains the text of the PDF, starting with r"C:\Components-of-Dot-NET.pdf".

     content = " ".join(content.replace(u"\xa0", " ").strip().split())
     return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to