Hi,
On 04/14/2018 02:38 AM, Mike Dewhirst wrote:
Does it actually stop users reading? If the entire migration happens
in a single transaction, the database (Postgres anyway) should remain
accessible until the moment it is committed.
Maybe you could announce a maintenance operation which will only
interrupt certain actions for a few minutes?
I am not sure if I understand well how that works. If the migration is
atomic, is is true that:
- the users can read normally between the beginning of the migration and
the beginning of the transaction commit, and they will get the old data
- the users trying to read during the transaction commit will have to
wait for it to finish, and they will get the migrated data
- the users who tries to write at anytime between the beginning of the
migration and the end of the transaction commit will have to wait for it
to finish, and they might overwrite the migrated data
If it works like that, I think that the solution I was thinking about is
good enough for my needs:
- process the queryset by chunks. Note that if you want to use
prefetch_related, you can't use a queryset iterator. So I get the
successive chunks by filtering with PK ranges ... I have benchmarked a
bit and a good value for the chunk size seems to be 500, it's not slower
then any other value and keeps the memory usage down. If I had a lot of
RAM, I could also raise this value to 5000 or 15000 without really
slowing the process.
- I have actually two separate atomic migrations (in case something goes
wrong, it is better then one non atomic migration with two atomic
operations): one to process the objects that are not modifiable by the
users (because they are in a certain status etc.) and a second to
process the objects that could be modified by the users. The second
migration concerns only 4-5% of the total objects, so it should be much
faster. As I use the PKs to fetch the objects, I have to fix the desired
chunk size of 500 in order to get some chunks of (approx.) the same size
with 500 * total_objects_count / filtered_queryset_count.
There remains the problem of users that would try to write during the
second migration: their changes will be written indeed to the old model,
but not taken in account by the new models (remember I want to split one
model in two smaller ones). So maybe I should append here to the second
migration all the operations that are responsible for deleting the old
model? This way, people trying to write will get an error - which is the
best we can do here. Am I right?
Thanks for your help!
Cheers,
Adrien
Adrien Cossa <co...@init.at> wrote:
Hi everybody!
I would like to know what options exist when you have a huge migration
that will obviously not run on your productive server.
I have spitted a model in two smaller ones and wrote then a migration
to populate these new models. The number of original objects is around
250,000 and I have also a few references to update. In the end, the
migration lasted more than 30 mn on my machine (16 GB RAM and it was
swapping a lot) and it failed on another machine because the RAM was
out (the process was using then about 13 GB). On the productive server
we have even less RAM so to run the migration as it is is really out
of question.
I have tried to use all the Django mechanisms that I know to optimize
the queries: select_related, prefetch_related, bulk_create,
QuerySet.update... Now, the migration I am talking about use
bulk_create(batch_size=None) and process the whole queryset at once.
Before that, as the migration was not so long lasting because I had 2
references less to update, I tried other values for batch_size and
also I processed the queryset as pages of a few hundreds or thousands
objects. The results were not better then batch_size=None and "all at
once", that's why I finally used "basic settings" (and the migration
was lasting about 5 minutes). I will have to reintroduce some tweaks
because the extra updates of the two relations I mentioned is making
here a big difference.
I am wondering if someone already found him/herself in a similar
situation, and with what solution you finally came to.
If the migration lasts very long, it's not a problem by itself but I
don't want to lock the database for 15 mn. The fact is that I don't
know what is happening during the migration process, what is locked by
what? I will split the migration in "pages" to use less RAM anyway,
but I was also thinking of migrating in two different steps *or*
files, in order to process separately the objects that are not
editable (basically most of them, that we keep for history but they
are read-only) and the others (which should be much faster and thus
people working will not be blocked for long). Does it make sense? Some
other ideas?
Thanks a lot!
Adrien
--
You received this message because you are subscribed to the Google Groups "Django
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/7043dd61-1c9f-0da2-db06-ae270f69a58c%40init.at.
For more options, visit https://groups.google.com/d/optout.