Re: scraping PDF

jeffself Sat, 29 Mar 2008 21:35:20 -0700

On Mar 29, 12:56 am, "James Bennett" <[EMAIL PROTECTED]> wrote:
> On Fri, Mar 28, 2008 at 11:21 PM,  <[EMAIL PROTECTED]> wrote:
> >  I'm trying to figure out the best way to link-up everything. Any
> >  suggestions?
>
> So, since I talked about it at PyCon, I'll take an example from this
> project:
>
> http://www2.ljworld.com/data/crime/ku/
>
> And I'll walk through this in a bit more detail than I did in my PyCon
> lightning talk, since I have more than five minutes to explain the
> process; of course, each data set you'll encounter will require some
> unique work to handle, but the general process is the same each time.
>
> The data was originally tables embedded in Microsoft Word documents,
> which were converted to HTML and then scraped with BeautifulSoup; the
> raw data was in the form of rows which looked like this:
>
> 04/23/2007 U077034 21-3701 14-304 1318 LOUISIANA 103 B 13 4 Building -
> Housing 88
> 11/26/2007 U0723675 21-3508a1 14-602 1323 OHIO   13 4 Building - Housing 88
> 08/14/2007 U0714884 21-3701a2 14-304 1515 ENGEL 307 O 12 4 Building - Housing 
> 88
>
> Some of this is irrelevant administrative stuff, so the bits I cared
> about here were:
>
> * The first field, which is the data of the crime report.
>
> * The third and fourth fields, which have data on the relevant
>   statute.
>
> * The fifth and sixth fields, which are the street address for the
>   report.
>
> I'd set up several models, but the relevant ones here were named
> ``ResidenceHall``, ``Offense`` and ``Crime``, where ``Crime``
> represents the actual report, and has foreign keys to ``Offense``
> (representing the particular crime -- burglary, vandalism, etc.) and
> to ``ResidenceHall``.
>
> What I ended up dumping to CSV was a set of data which looked like
> this:
>
> 2007-04-23,Theft,K.K. Amini Scholarship Hall
> 2007-11-26,Lewd and lascivious behavior,Dennis E. Rieger Scholarship Hall
> 2007-08-14,Theft,Templin Hall
>
> The path from the raw data to this CSV was a process of normalization;
> I chose to convert to a format of:
>
> report date,offense,residence hall
>
> Largely because of a couple of ambiguities in using other aspects of
> the data:
>
> * Though a statute number is unique within a specific legal code, we
>   were dealing with two different codes: the state laws and the city
>   ordinances. The name of the offense, however, was unique for both
>   sets.
>
> * Some residences are actually complexes of multiple buildings, or can
>   be otherwise be referred to using multiple street addresses; the
>   name of the residence is unique, though.
>
> So I set up two dictionaries: one mapped statute numbers to names of
> offenses, the other mapped street addresses to names of residence
> halls. From there, it was easy to loop over the raw data, look up the
> offense and the residence, and write out one row of normalized CSV for
> each row of raw data.
>
> This normalized CSV file was then used for a couple different
> purposes, including some initial exploration of the data by pulling it
> into a spreadsheet, and then the database import was handled by
> a script which:
>
> 1. Read in a row of CSV.
>
> 2. Used ``strftime`` to get a ``date`` object from the report date.
>
> 3. Looked up the offense by name from the ``Offense`` model.
>
> 4. Looked up the residence hall by name from the ``ResidenceHall``
>    model.
>
> 5. Instantiated and saved a ``Crime`` object from these three pieces
>    of data (since they corresponded to the fields on that model).
>
> And at that point I had nice, normalized, structured data and I could
> start building out the views of it.
>
> So the broad steps of the process are:
>
> 1. Find a way to get at the raw data so that it's easy to read from
>    Python; in this case, that meant turning Word docs into HTML.
>
> 2. Figure out which parts of the data you care about, and build models
>    to represent them in a structured way.
>
> 3. Read through the raw data and normalize it based on things you can
>    guarantee will be unique to each type of record, and write this out
>    to a standard format like CSV.
>
> 4. Run your database import from the normalized data; by this point
>    it'll be in a format where you can simply look up relations on the
>    fly and fill them in as you create new objects.
>
> Once you get used to it, and get over the initial hurdle of figuring
> out how to read the data from Python, this tends to go pretty quickly;
> as I mentioned in my PyCon talk, this crime-report project had the
> browseable database + views to drill down through the data within two
> days.
>
> --
> "Bureaucrat Conrad, you are technically correct -- the best kind of correct."

I'm working on a similar project now, but it involves an election.  I
have direct access to the data so screen scraping won't be necessary.
However, the data is in Excel and I plan to either create CSV files
from Excel or use Python to directly read the Excel file and build a
CSV file from it.  The issue I see is with the import of the CSV file
into the database.  Do I blow away the table and recreate it everytime
I import from the CSV file (which will be updated frequently from
about 7pm until 9pm)?  And do I just have one CSV file or should I
have one for each race (Mayor, City Council, and School Board)?  We
have 53 voting precincts, each one having all three races and there is
a total of 20 candidates between the three races.  So that means I'll
have a CSV file with over 3000 rows in it if I go with only one file.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: scraping PDF

Reply via email to