I've been working on a bulk_update method for a project to allow saving
many instances of a Model in a single call/query, an analog to the existing
bulk_create method.
There seems to be some interest in this as evidenced by a library
django-bulk-update <https://github.com/aykut/django-bulk-update> (which
isn't working for me in tests with python3 and SQLite) and a recent ticket
<https://code.djangoproject.com/ticket/23646>.
The ticket was closed as wontfix with an invitation for discussion,
although the author doesn't seem to have pursued it. However, Shai Berger
did comment positively on the ticket and proposed a different API, similar
to the one in django-bulk-update and my own prototype:
Book.objects.update_many(books, 'price')
My own attempt was on Django 1.7 using a subclass of ExpressionNode passed
as a value into UpdateQuery.add_update_values which allows Django to
generate most of the UPDATE SQL, while the expression subclass was only
responsible for generating an SQL CASE expression. This quickly generalized
into a Case expression class which could be passed as a value into
QuerySet.update.
But now that Query Expressions have been merged, I've done a quick port and
the Case expression now works with updates, annotations and aggregates, and
is used to power my bulk_update method.
A simple performance test shows that this works quite well. I tested the
following methods of updating many database rows, using an in memory SQLite
database :
# loop_save
for o in objects:
o.save(update_fields=['field'])
# bulk_update
MyModel.objects.bulk_update(objects, update_fields=['field'])
# many_updates
q = Q(condition__lt=test_value)
MyModel.objects.filter(q).update(field=value1)
MyModel.objects.filter(~q).update(field=value2)
# case_update
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(q, value1)], default=value2,
output_field=MyModel._meta.get_field('field')))
With 10000 objects in the database, and updating all of them I got these
run times using clock():
loop_save: 3.022845626653664
bulk_update: 0.1785595393377175
many_updates: 0.009768530320993563
case_update: 0.009343744353718986
We can see that bulk_update outperforms looping by an order of magnitude,
despite working by generating a CASE expression with a WHEN clause
for every object conditioned on its pk.
Using Case in an update has no significant performance improvement, over
running multiple queries. It may arguably allow for more readable code when
using many conditions since the semantics of case are that conditions are
evaluated in order, like a Python if ... elif ... else. Compare this:
MyModel.objects.filter(Q(condition__lt=test_value1)).update(field=value1)
MyModel.objects.filter(Q(condition__gte=test_value1,
condition__lt=test_value2)).update(field=value2)
MyModel.objects.filter(Q(condition__gte=test_value2,
condition__lt=test_value3)).update(field=value3)
MyModel.objects.filter(Q(condition__gte=test_value3,
condition__lt=test_value4)).update(field=value4)
MyModel.objects.filter(Q(condition__gte=test_value4)).update(field=value5)
as opposed to:
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(Q(condition__lt=test_value1), value1),
(Q(condition__lt=test_value2), value2),
(Q(condition__lt=test_value3), value3),
(Q(condition__lt=test_value4), value4)],
default=value5,
output_field=MyModel._meta.get_field('field')))
With only 10 objects in the database, and updating all of them the times
are:
loop_save: 0.004131221428270882
bulk_update: 0.0006249582329893796
many_updates: 0.0006562061446388481
case_update: 0.00036259952923938313
So the bulk_update performance gains are still there, and update using a
Case expression if faster because the SQL generation time dominates in this
example, and only one query needs to be generated and executed.
Case expressions can also be used in other ways (although these examples
use a subclass SimpleCase, which is described in a little more detail
below, for simpler syntax):
# annotation
MyModel.objects.annotate(status_text=SimpleCase('status', [('S',
'Started'), ('R', 'Running'), ('F', 'Finished')],
default='Unknown',
output_field=CharField()))
# aggregation
MyModel.objects.aggregate(started=Sum(SimpleCase('status', [('S', 1)],
default=0, output_field=IntegerField())),
running=Sum(SimpleCase('status', [('R', 1)],
default=0, output_field=IntegerField())),
finished=Sum(SimpleCase('status', [('F', 1)],
default=0, output_field=IntegerField())))
In these examples SimpleCase is a subclass of Case that generates the SQL
simple CASE expression which is an equality test against a field.
So is this something that would be a worthwhile addition to core?
It could work now as a separate module, but the update example requires
setting output_field. My original Django 1.7 code abused the
prepare_database_save method that SQLUpdateCompiler called on an
ExpressionNode to get the field that the expression would be assigned to
automatically. After the merge of Query Expression SQLUpdateCompiler calls
either resolve_expression or prepare_database_save, so I can no longer use
this. A change in SQLUpdateCompiler would be required with a new API for
settings output_field automatically on ExpressionNodes used as update
values.
If this were considered for core, I would appreciate input on the API. I
currently have:
Case(list_of_case_tuples, default, output_field) - A general CASE
expression called a searched case in the SQL spec. list_of_case_tuples is
an iterable of tuples of the form (Q_object, value), and default is the
value for the ELSE clause. This generates SQL like:
CASE WHEN n > 0 THEN 'positive' WHEN n < 0 THEN 'negative' ELSE 'zero' END
SimpleCase(fieldname, list_of_simple_case_tuples, default, output_field) -
A simple case expression in the SQL spec. fieldname is a field identifier
like in an F expression, list_of_simple_case_tuples is an iterable of
tuples of the form (condition_value, result_value). This generates SQL like:
CASE n WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE 'other' END
Model.objects.bulk_update(list_of_instances, update_fields) - An analog to
the bulk_create method that save changes in many model instances. list of
instances is an iterable of model instances, update_fields is the same as
the argument to Model.save of the same name.
Of course I will put up a branch on github if there is interest in this
proposal. Thanks for your time and sorry for the wall of text.
- Michael
--
You received this message because you are subscribed to the Google Groups
"Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-developers/000de073-e26e-4b79-ae1b-67ba2f3d7180%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.