> My team has built a service comprising 3 main parts, a web application
and 2 long-running worker processes that process events from a message
exchange. All of these components interact with the same database and use
the same underlying Django "app" for ORM models (i.e. the 2 worker
processes call django.setup() on initialization).
>

Are the two long term worker instances Django processes, or some other
process? If they are Django workers, you're probably not handling jobs
correctly.

> We've had some issues with the worker processes failing to recover in the
face of DB connectivity issues. For example at one point Amazon restarted
our DB (it's an RDS instance) and the workers started flailing, repeatedly
raising the same exceptions despite the DB coming back online. Later on we
discovered that we could fix this particular issue by calling
django.db.connection.close() when this exception occurred (it happened to
be InterfaceError); on the next attempt to interact w/ the DB Django would
establish a new connection to the DB and everything would continue to work.
More recently a new error occurred that caused a similar problem, leading
us to speculate that we should do the same thing in this case with this new
type of exception (I think now it's OperationalError because the DB went
into "recovery mode" or something).
>

You're right. Django is not really designed to be held open in this manner.

> We are now planning on refactoring this service a bit so that instead of
attempting to recover from exceptions, we'll just terminate the process and
configure an external agent to automatically restart in the face of
unexpected errors. This feels like a safer design than trying to figure out
every exception type we should be handling. However I wanted to reach out
to the Django group as a sanity check to see if we're missing something
more basic. From browsing various tickets in Django's issue tracker I've
gotten the impression that we may be swimming upstream a little bit as
Django is designed as a web framework and relies on DB connections being
closed or returned to a pool or something automatically at the end of the
request cycle, not held open by a single loop in a long-running process. Is
there something special we should be doing in these worker processes? A
special Django setting perhaps? Should we just be calling
connection.close() after processing each event? Should we not be using
Django at all in this case?
>

The answer is yes, you can/should use Django, but not to the extent of your
current implementation. Your long running jobs should be collected by
Django and passed off immediately to a batch processor designed for
long-running jobs (although your jobs may not be long-running, it sounds
like you are just waiting for incoming job requests).

Celery is a popular choice for batch processing with Django. It has hooks
built specifically for Django, and is well documented. It does require a
message broker such as Redis or RabbitMQ to keep track of the jobs, though.
However, it is designed to work directly with your Django instance,
including support for the ORM against your existing database.

http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html

> I think the pessimistic kill-and-restart strategy we've decided upon for
now will work, but any guidance here to ensure we aren't fighting against
our own framework would be much appreciated.
>

My recommendation would be to investigate a batch processor such as Celery.
Depending on the number of jobs you are running, if you have individual
jobs running rather than a long process, the chances of a DB restart
causing panic are mitigated to just the few jobs that happen to run at that
moment. You also have granular control over the failure behavior of
individual jobs. Some may be one-shot jobs that simply fail and report,
others may retry.

Also, I would recommend at least coding recovery behavior for the known
failure cases. This list may obviously grow over time, but that's what
keeps developers employed, right? ;-)

If you do keep a long running process going, I would recommend that you
keep a tight grip on the connection state in your loop, maybe even close it
from time to time as a sanity check to make sure the DB is really alive.

You should also have some sort of external network monitoring set up if the
application has any sort of value or service expectation. That may include
ongoing automated functional testing submitting test jobs, etc.

Preemptively catching DB failures with no production impact is a great way
to impress your employer, and make a case to complain to Amazon with trend
data.

-James

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CA%2Be%2BciX4cah_os_DSD0XRyG58YYQ4g3LQRxTktDyqOcSLUcf7w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to