On Fri, Mar 28, 2008 at 11:21 PM, <[EMAIL PROTECTED]> wrote: > I'm trying to figure out the best way to link-up everything. Any > suggestions?
So, since I talked about it at PyCon, I'll take an example from this project: http://www2.ljworld.com/data/crime/ku/ And I'll walk through this in a bit more detail than I did in my PyCon lightning talk, since I have more than five minutes to explain the process; of course, each data set you'll encounter will require some unique work to handle, but the general process is the same each time. The data was originally tables embedded in Microsoft Word documents, which were converted to HTML and then scraped with BeautifulSoup; the raw data was in the form of rows which looked like this: 04/23/2007 U077034 21-3701 14-304 1318 LOUISIANA 103 B 13 4 Building - Housing 88 11/26/2007 U0723675 21-3508a1 14-602 1323 OHIO 13 4 Building - Housing 88 08/14/2007 U0714884 21-3701a2 14-304 1515 ENGEL 307 O 12 4 Building - Housing 88 Some of this is irrelevant administrative stuff, so the bits I cared about here were: * The first field, which is the data of the crime report. * The third and fourth fields, which have data on the relevant statute. * The fifth and sixth fields, which are the street address for the report. I'd set up several models, but the relevant ones here were named ``ResidenceHall``, ``Offense`` and ``Crime``, where ``Crime`` represents the actual report, and has foreign keys to ``Offense`` (representing the particular crime -- burglary, vandalism, etc.) and to ``ResidenceHall``. What I ended up dumping to CSV was a set of data which looked like this: 2007-04-23,Theft,K.K. Amini Scholarship Hall 2007-11-26,Lewd and lascivious behavior,Dennis E. Rieger Scholarship Hall 2007-08-14,Theft,Templin Hall The path from the raw data to this CSV was a process of normalization; I chose to convert to a format of: report date,offense,residence hall Largely because of a couple of ambiguities in using other aspects of the data: * Though a statute number is unique within a specific legal code, we were dealing with two different codes: the state laws and the city ordinances. The name of the offense, however, was unique for both sets. * Some residences are actually complexes of multiple buildings, or can be otherwise be referred to using multiple street addresses; the name of the residence is unique, though. So I set up two dictionaries: one mapped statute numbers to names of offenses, the other mapped street addresses to names of residence halls. From there, it was easy to loop over the raw data, look up the offense and the residence, and write out one row of normalized CSV for each row of raw data. This normalized CSV file was then used for a couple different purposes, including some initial exploration of the data by pulling it into a spreadsheet, and then the database import was handled by a script which: 1. Read in a row of CSV. 2. Used ``strftime`` to get a ``date`` object from the report date. 3. Looked up the offense by name from the ``Offense`` model. 4. Looked up the residence hall by name from the ``ResidenceHall`` model. 5. Instantiated and saved a ``Crime`` object from these three pieces of data (since they corresponded to the fields on that model). And at that point I had nice, normalized, structured data and I could start building out the views of it. So the broad steps of the process are: 1. Find a way to get at the raw data so that it's easy to read from Python; in this case, that meant turning Word docs into HTML. 2. Figure out which parts of the data you care about, and build models to represent them in a structured way. 3. Read through the raw data and normalize it based on things you can guarantee will be unique to each type of record, and write this out to a standard format like CSV. 4. Run your database import from the normalized data; by this point it'll be in a format where you can simply look up relations on the fly and fill them in as you create new objects. Once you get used to it, and get over the initial hurdle of figuring out how to read the data from Python, this tends to go pretty quickly; as I mentioned in my PyCon talk, this crime-report project had the browseable database + views to drill down through the data within two days. -- "Bureaucrat Conrad, you are technically correct -- the best kind of correct." --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---