On Fri, Oct 21, 2011 at 8:08 PM, Shawn Milochik <sh...@milochik.com> wrote: > Real-life situation: > > I deployed a week's worth of code changes to production. Upon > restart, site wouldn't load anymore. Nothing but a 404 error. > > Long story short: > > By using Django's logging, I discovered that a query was being run > tens thousands of times -- obviously in a loop. I inserted > traceback.format_stack() logging into Django's code temporarily to > pinpoint the culprit. I discovered that when the URLs were imported > and admin.autodiscover() was called it imported an admin.py file, > which imported a models.py file, which imported a forms.py file, which > contained a Form subclass that was doing something woefully > inefficient with a "choices" kwarg. Optimized query, moved it into the > form's __init__, problem solved. > > But in the meantime: > > The site was down for about a half-hour in the time it took me to > work around the problem by realizing that gunicorn workers were timing > out and increasing the timeout in my supervisord config. And that was > just a quick & dirty "fix" to give me time to find the real problem, > which took considerably longer. > > What I need: > > What's the best way to start looking for a problem like this? I > started with pdb, but it's a ridiculously inefficient tool for this > kind of problem, and I went off in the wrong direction a couple of > times trying to zero in on the issue. It's just a pleasant coincidence > that when I checked the log there was a huge number of queries which > set me off in the right direction. > > Maybe part of the problem is that I'm not familiar with Django's > bootstrapping process, so I don't know where to sprinkle logging > statements to isolate the issue. > > I'm looking for a general solution. Don't assume the issue is > necessarily ORM-related or anything in particular. Just that something > is slow and I'm trying to find out what it is. > > Thanks in advance for any wisdom and hard-earned experience you can share. > > Shawn
Hi Shawn 30 minutes to go from "wtf? now nothing is working" to "ok, that was silly, fixed" doesn't seem too bad. The important thing is to not expose that to end users. We do this by having an insulated integration environment, which duplicates the conditions on the frontend, and allows us to run a test load through the site. Anything which looks hinky on integration doesn't make it to production. Finally, we run two app servers, backed by a MySQL master-master replication - this is for reliability and recovery, not performance. When we do an upgrade, we temporarily send all http traffic to one of the app servers, and upgrade the other one. At that point, we change and push all traffic through the updated app server and let it run for a good few hours. If anything goes wrong at this point, we can go back to the un-updated version and revert the other app server back, otherwise we can complete the update on the other app server, and push traffic equally to both servers. Our integration infrastructure uses Squid and Apache to 'fake up' our DC - we change proxy to the one in the integration infra, which then redirects public names into the integration infra. Our app servers run apache/mod_fcgi, and our frontends run apache and HAproxy. Cheers Tom -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.