Performance when retrieving data from related models

Taras_96 Fri, 15 Mar 2013 20:04:14 -0700

My web app hits the database up to 3k per page because it has to retrieve 
data from lots of related models. It has a fairly normalised database  with 
a structure similar to:


- one project has many contracts
- one project has many tenders (a tender is put forward by a contractor)
- one contract has many jobs

(there are a few extra relations not shown for simplicity). One of the 
pages, for example, lists out details for every contract a contractors has 
tendered). I've naively implemented this by treating the Django models as I 
would OO models, resulting in about 3k database hits for the one page. This 
is because something like:

qs = Tender.objects.filter(filter_criteria)
for tender in qs:
  print tender.getProject()

where getProject() looks something like:

def getProject(self):
  return self.project

I think here that the extra level of indirection means that Django can't 
conclude that all 'project' objects for each tender from the query set 
should be retrieved, and instead it ends up hitting the database once per 
loop. At the moment I do this in quite a few places, contributing to the 3k 
database hits.

One way of of solving this is for the query to fetch all related data, 
something like:

qs = Tender.objects.sql_related(/*names of keyed relationships 
here*/).prefetch_related('project', 'project__job_set') (the actual example 
has many more relations in the prefetch_related). This fixes the problem of 
hitting the database in loops such as the above, and the complexity can be 
hidden in a data retrieval layer that the client calls into.

returnedTenders = getTendersForContractor(contractor) # << the actual query 
with prefetches etc is done in this function

However, there are a few drawbacks to this:

* Major one is that data is fetched that you might not require. EG: if 
there was another page that just listed the ids of the Tenders, then a lot 
of data would have been fetched that is not required
** you could mitigate this by having the client somehow specify which data 
requires prefetching, but
*** this exposes implementation detail to the client
*** the function calling into the data retrieval layer may not know what 
data requires prefetching, and instead it is a function higher up in the 
call chain (the function that actually uses the data) that has this 
information
* Each prefetch_related hits the database once - a lot better than 3k hits, 
but not as minimal as a sql join which joins all of the tables
* Prefetch_related selects related obects via the criteria: 'in (list of 
ids'), which arguably can be slower
* Seems brittle - if you change the data model, you have to change all of 
the queries in the data retrieval layer. Denomralising the data would 
reduce this, but we'd still have the problem of retrieving more data than 
the client may possibly want

What is the group's suggestions/advice?


-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Performance when retrieving data from related models

Reply via email to