On Fri, Mar 28, 2008 at 11:21 PM,  <[EMAIL PROTECTED]> wrote:
>  I'm trying to figure out the best way to link-up everything. Any
>  suggestions?

So, since I talked about it at PyCon, I'll take an example from this
project:

http://www2.ljworld.com/data/crime/ku/

And I'll walk through this in a bit more detail than I did in my PyCon
lightning talk, since I have more than five minutes to explain the
process; of course, each data set you'll encounter will require some
unique work to handle, but the general process is the same each time.

The data was originally tables embedded in Microsoft Word documents,
which were converted to HTML and then scraped with BeautifulSoup; the
raw data was in the form of rows which looked like this:

04/23/2007 U077034 21-3701 14-304 1318 LOUISIANA 103 B 13 4 Building -
Housing 88
11/26/2007 U0723675 21-3508a1 14-602 1323 OHIO   13 4 Building - Housing 88
08/14/2007 U0714884 21-3701a2 14-304 1515 ENGEL 307 O 12 4 Building - Housing 88

Some of this is irrelevant administrative stuff, so the bits I cared
about here were:

* The first field, which is the data of the crime report.

* The third and fourth fields, which have data on the relevant
  statute.

* The fifth and sixth fields, which are the street address for the
  report.

I'd set up several models, but the relevant ones here were named
``ResidenceHall``, ``Offense`` and ``Crime``, where ``Crime``
represents the actual report, and has foreign keys to ``Offense``
(representing the particular crime -- burglary, vandalism, etc.) and
to ``ResidenceHall``.

What I ended up dumping to CSV was a set of data which looked like
this:

2007-04-23,Theft,K.K. Amini Scholarship Hall
2007-11-26,Lewd and lascivious behavior,Dennis E. Rieger Scholarship Hall
2007-08-14,Theft,Templin Hall

The path from the raw data to this CSV was a process of normalization;
I chose to convert to a format of:

report date,offense,residence hall

Largely because of a couple of ambiguities in using other aspects of
the data:

* Though a statute number is unique within a specific legal code, we
  were dealing with two different codes: the state laws and the city
  ordinances. The name of the offense, however, was unique for both
  sets.

* Some residences are actually complexes of multiple buildings, or can
  be otherwise be referred to using multiple street addresses; the
  name of the residence is unique, though.

So I set up two dictionaries: one mapped statute numbers to names of
offenses, the other mapped street addresses to names of residence
halls. From there, it was easy to loop over the raw data, look up the
offense and the residence, and write out one row of normalized CSV for
each row of raw data.

This normalized CSV file was then used for a couple different
purposes, including some initial exploration of the data by pulling it
into a spreadsheet, and then the database import was handled by
a script which:

1. Read in a row of CSV.

2. Used ``strftime`` to get a ``date`` object from the report date.

3. Looked up the offense by name from the ``Offense`` model.

4. Looked up the residence hall by name from the ``ResidenceHall``
   model.

5. Instantiated and saved a ``Crime`` object from these three pieces
   of data (since they corresponded to the fields on that model).

And at that point I had nice, normalized, structured data and I could
start building out the views of it.


So the broad steps of the process are:

1. Find a way to get at the raw data so that it's easy to read from
   Python; in this case, that meant turning Word docs into HTML.

2. Figure out which parts of the data you care about, and build models
   to represent them in a structured way.

3. Read through the raw data and normalize it based on things you can
   guarantee will be unique to each type of record, and write this out
   to a standard format like CSV.

4. Run your database import from the normalized data; by this point
   it'll be in a format where you can simply look up relations on the
   fly and fill them in as you create new objects.

Once you get used to it, and get over the initial hurdle of figuring
out how to read the data from Python, this tends to go pretty quickly;
as I mentioned in my PyCon talk, this crime-report project had the
browseable database + views to drill down through the data within two
days.





-- 
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to