I have long thought that bulk_save(objs) QuerySet method would be an
useful addition to Django. The method would be defined as:

        Save the objects to the database using fastest available method.
        The objs can contain both objects that already exist in the
        database and thus need to be updated, and objects that need to
        be inserted to the database. The method is allowed to fail due
        to concurrent modifications to the database.

The basic implementation would be:
  1. Check which objects have primary key set
  2. For those objects which have primary key set, check which objects
exists in the database
  3. Update the objects which exists in the database in bulk (this step
would be similar to the proposal in this thread)
  4. Insert the rest of the objects in bulk

On backends that support merge or upsert, Django could instead just do a
single statement upsert/merge to the table.

Use cases would be fixture loading, synchronizing data from external
data sources and other bulk loading cases.

If this method were available there would be no need for the multi-value
update statement.

I did a quick benchmark comparing multi-rows update of 1000 rows to
single row updates. There is almost an order of magnitude difference in
speed: around 17-18 milliseconds compared to 115 milliseconds. The test
script was:

from datetime import datetime
from django.db import transaction
from django.db import connection
cursor = connection.cursor()
cursor.execute("create table foo1 (id integer primary key, val1 text, val2 
date)")
with transaction.atomic():
    for i in range(0, 10000):
        cursor.execute("insert into foo1 values(%s, %s, now())", (i, str(i)))
for j in range(0, 10):
    cursor.execute("vacuum full foo1")
    with transaction.atomic():
        start = datetime.now()
        values_str = ','.join('(%s, %s)' for i in range(0, 1000))
        params = []
        for i in range(0, 1000):
            params.extend([i + i, i])
        cursor.execute("update foo1 set val1 = vals.iplusi "
                       "from (values %s) as vals(iplusi, i) where id = vals.i" 
% values_str,
                       params)
        print("update %s: %s" % (j, datetime.now() - start))
for j in range(0, 10):
    cursor.execute("vacuum full foo1")
    with transaction.atomic():
        start = datetime.now()
        for i in range(0, 1000):
            cursor.execute("update foo1 set val1 = %s, val2 = now() where id = 
%s", (str(i + i), i))
        print("update %s: %s" % (j, datetime.now() - start))

The results were:
[Multi-update]
update 0: 0:00:00.022665
update 1: 0:00:00.018951
update 2: 0:00:00.016230
update 3: 0:00:00.017642
update 4: 0:00:00.015061
update 5: 0:00:00.016559
update 6: 0:00:00.018298
update 7: 0:00:00.017187
update 8: 0:00:00.014943
update 9: 0:00:00.014960
[One row per query update]
update 0: 0:00:00.114907
update 1: 0:00:00.115965
update 2: 0:00:00.109886
update 3: 0:00:00.118936
update 4: 0:00:00.114295
update 5: 0:00:00.119926
update 6: 0:00:00.117856
update 7: 0:00:00.115348
update 8: 0:00:00.115445
update 9: 0:00:00.122262

The more complex the table, the less difference I expect there to be as
the overhead of single row update processing is less compared to all the
other work required by the query. On the other hand if ran against a
database on remote host (as opposed to localhost PostgreSQL in the
test), the larger the network overhead per query would be, and thus the
results could look a lot worse for single row updates. Using localhost
TCP connection instead of socket connection changes the single row
update execution time to 200ms, while the multi-row update case's
execution time doesn't change noticeably.

 - Anssi

On Mon, 2014-10-20 at 01:22 -0500, Javier Guerra Giraldez wrote:
> On Mon, Oct 20, 2014 at 1:03 AM, Anshuman Aggarwal
> <[email protected]> wrote:
> > The idea of having a .update() ORM construct is to be able to do this
> > without having to fall down to a manual transaction every time, otherwise
> > why have a DB level <qs>.update()...I am sure the performance of above
> > pseudo code would be about the same (or sufficiently small as to be
> > ignorable)...
> 
> 
> i'm sure it's far from ignorable.  <qs>.update() generates an UPDATE
> ... WHERE ... statement, not specifying each record to be updated,
> just the criteria to let the DB engine choose them.  Also, the new
> field content is either a constant or specified as a function of other
> fields (when using F() objects); again, not specified individually,
> but let to the DB engine to work it out.
> 
> in sum, it's a _lot_ less data to prepare, send, and interpret and
> uses a heavily optimized, totally SQL statement.  Not having it would
> be a big missing feature.
> 
> -- 
> Javier
> 


-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/1413793246.21107.175.camel%40TTY32.
For more options, visit https://groups.google.com/d/optout.

Reply via email to