First, this post actually already talks about it (found after doing my
own profiling with guppy/heapy):
http://groups.google.com/group/web2py/browse_thread/thread/534961af8e972300/212d7a59f2a7f239

The general idea is that all sql statements are stored as is in
db._timings. This isn't an issue, except if
A) You are doing very, very large sql statements. I'm uploading lots
of files, storing them in the DB, which triggers this to a lesser
extent.
B) You are running a "background job" as detailed in the book. This
includes queues, etc. Basically, if you continue using the same DB
object for any reason for a considerable period of time, or if you're
doing large inserts in a batch/background job, this issue probably
will bite you.

Here's my setup:
I'm putting through roughly 60k transactions in just under 3 minutes,
across 4 servers running 3 instances of web2py each. The performance
is staggeringly slow (for the traffic I'm pushing through) if I use
the DAL, thus I'm storing into redis as a "transactional" storage
place. I've got batch jobs setup to then pull out of redis, and sync
with the historical database. In addition to the pure transactions
being stored, I'm also pushing roughly 25k recordings, wav format,
about 3-8 seconds each into the database. (Using a shared drive isn't
an option for several reasons, and believe it or not storing the
recordings in the DB as a blob field actually works pretty well)
My batch jobs run for 6 hours before restarting themselves. There's
one addiitonal machine for the batch jobs, running 1 job per type, 3
jobs total (one per core with one free core). The jobs are configured
as cron jobs in web2py. I've created a redis based locking system to
prevent multiple job spawns, and to restart within a minute if not
processing.

The problem:
Within the batch job run time period, db._timings stores the content
of every sql statement passed through. Unfortunately, this is also the
content of blob fields as well as some very large (>5k) logging
archive fields, etc. This means that within a few minutes, I have
multiple hundreds of MB in memory that have been inserted into the DB.

The solution (interim and suggestion for the code):
I have a thread in each batch job set to maintain lock states, manage
job queues, and sync various minor systems. It also makes sure that a
db.commit gets called at least every 5 seconds (to ensure the data is
available within a reasonable time frame, and limit data failure in
case of a crash). Adding "db._timings = []" to the db.commit portion
reduces the amount of memory used for db access longterm. Suggestion
for the code would be, why wouldn't db.commit clear db._timings within
the DAL? I'm not 100% sure why _timings is there in the first place,
other than for some debugging.

Reply via email to