WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

john . parton Wed, 03 Feb 2016 08:04:21 -0800

Greetings!

I'm trying to refactor a query to avoid using QuerySet.extra (as per the 
recommendation here: 
https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)


Here's a simplified version of my code:

# testapp.models

class Parent(models.Model):
    pass

class Child(models.Model):
    parent = models.ForeignKey('testapp.Parent')

This is the query I am attempting to replace:

Parent.objects.extra(where=["""
    NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id = 
testapp_parent.id)
"""])

This is extremely similar to, but not the same as

Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))

Both queries should return the same rows, but the biggest difference is 
that PostgreSQL's query planner produces completely different plans. The 
performance difference can be huge, even for a relatively modest data set. 
(I believe my data set has something like 270k "parent" instances and 80k 
"child" instances.)

I tried searching this group as well as the ticket system, and couldn't 
find a solution to this exact problem (other than using the alternative 
query).

Thanks for taking the time to read through all of this! I am more than 
happy to open a ticket, but I thought I should post here first. 


More Technical Stuff

Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to 
apply a LIMIT 100 because if I try to return the entire result, the second 
one completely hangs my computer. I bolded some relevant bits

Limit  (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483 
rows=100 loops=1)
  ->  *Merge Anti Join*  (cost=0.71..22435.69 rows=249613 width=111) 
(actual time=0.072..0.465 rows=100 loops=1)
        Merge Cond: (catalogue_product.id = 
catalogue_productcategory.product_id)"
        ->  *Index Scan* using catalogue_product_pkey on catalogue_product  
(cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285 
rows=150 loops=1)
        ->  *Index Only Scan* using catalogue_productcategory_9bea82de on 
catalogue_productcategory  (cost=0.29..3861.27 rows=81671 width=4) (actual 
time=0.037..0.070 rows=101 loops=1)
              Heap Fetches: 0
Total runtime: 

*0.568 ms*Here's the plan for the second query:
Limit  (cost=0.00..234229.95 rows=100 width=111) (actual 
time=31.087..1165.022 rows=100 loops=1)
  ->  *Seq Scan* on catalogue_product  (cost=0.00..342373919.34 rows=146170 
width=111) (actual time=31.087..1164.992 rows=100 loops=1)
        Filter: (NOT (SubPlan 1))
        Rows Removed by Filter: 5
        SubPlan 1
          ->  *Materialize*  (cost=0.00..2138.07 rows=81671 width=4) 
(actual time=0.001..5.539 rows=80818 loops=105)
                ->  *Seq Scan* on catalogue_productcategory u0  
(cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574 
rows=81671 loops=1)
Total runtime: 
*1165.255 ms*
As you can see the former is about ~2000 times faster than the latter.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/efc7a911-1ec7-4dc6-9b5a-c31832c071f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

Reply via email to