On Fri, Oct 21, 2011 at 8:08 PM, Shawn Milochik <sh...@milochik.com> wrote:
> Real-life situation:
>
>    I deployed a week's worth of code changes to production. Upon
> restart, site wouldn't load anymore. Nothing but a 404 error.
>
> Long story short:
>
>    By using Django's logging, I discovered that a query was being run
> tens thousands of times -- obviously in a loop. I inserted
> traceback.format_stack() logging into Django's code temporarily to
> pinpoint the culprit. I discovered that when the URLs were imported
> and admin.autodiscover() was called it imported an admin.py file,
> which imported a models.py file, which imported a forms.py file, which
> contained a Form subclass that was doing something woefully
> inefficient with a "choices" kwarg. Optimized query, moved it into the
> form's __init__, problem solved.
>
> But in the meantime:
>
>    The site was down for about a half-hour in the time it took me to
> work around the problem by realizing that gunicorn workers were timing
> out and increasing the timeout in my supervisord config. And that was
> just a quick & dirty "fix" to give me time to find the real problem,
> which took considerably longer.
>
> What I need:
>
>    What's the best way to start looking for a problem like this? I
> started with pdb, but it's a ridiculously inefficient tool for this
> kind of problem, and I went off in the wrong direction a couple of
> times trying to zero in on the issue. It's just a pleasant coincidence
> that when I checked the log there was a huge number of queries which
> set me off in the right direction.
>
>    Maybe part of the problem is that I'm not familiar with Django's
> bootstrapping process, so I don't know where to sprinkle logging
> statements to isolate the issue.
>
>    I'm looking for a general solution. Don't assume the issue is
> necessarily ORM-related or anything in particular. Just that something
> is slow and I'm trying to find out what it is.
>
> Thanks in advance for any wisdom and hard-earned experience you can share.
>
> Shawn

Hi Shawn

30 minutes to go from "wtf? now nothing is working" to "ok, that was
silly, fixed" doesn't seem too bad. The important thing is to not
expose that to end users.

We do this by having an insulated integration environment, which
duplicates the conditions on the frontend, and allows us to run a test
load through the site. Anything which looks hinky on integration
doesn't make it to production.

Finally, we run two app servers, backed by a MySQL master-master
replication - this is for reliability and recovery, not performance.
When we do an upgrade, we temporarily send all http traffic to one of
the app servers, and upgrade the other one. At that point, we change
and push all traffic through the updated app server and let it run for
a good few hours.

If anything goes wrong at this point, we can go back to the un-updated
version and revert the other app server back, otherwise we can
complete the update on the other app server, and push traffic equally
to both servers.

Our integration infrastructure uses Squid and Apache to 'fake up' our
DC - we change proxy to the one in the integration infra, which then
redirects public names into the integration infra. Our app servers run
apache/mod_fcgi, and our frontends run apache and HAproxy.

Cheers

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Reply via email to