On Fri, Mar 28, 2008 at 10:15 PM,  <[EMAIL PROTECTED]> wrote:
>  our state legislature has all their reports online in PDF format, i
>  was hoping to scrape 'em and get them and use them with django to
>  create something similar to what adrian did with the w-p and others
>  have done.

There are a couple freely-available libraries that can scrape PDF;
pyPdf [1], for example, is BSD licensed and seems to be actively
maintained, and can read the text out of a PDF for you. From there you
can pretty easily fiddle with the text; the Python Cookbook has a
recipe [2] for reading the text from a PDF programmatically, for
example.

For getting data from PDF into a database, I (personally) generally
convert to an intermediate format like CSV, which has the advantage of
also working in a lot of spreadsheet tools for people to browse while
you're getting the DB import going.


[1] http://pybrary.net/pyPdf/
[2] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/511465


-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to