Hi all, First of all, many thanks to all involved in web2py, it's been working out very well for me :)
I've been using the scheduler a lot, and I think I've found a source of database deadlocks. As far as I can tell, it happens when the system is deleting what it thinks are dead workers because heartbeats have timed out, and electing a new ticker, and two or more worker processes are trying to do this at the same time. They each update there own heartbeats and then do several update/delete operations on a number of the scheduler_worker rows, without intervening db.commits, and the 'many rows' operations collide with each other. These can range from simple: 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 ERROR: 40P01: deadlock detected 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 DETAIL: Process 5244 waits for ShareLock on transaction 72287313; blocked by process 6236. Process 6236 waits for ShareLock on transaction 72287062; blocked by process 5244. Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE (scheduler_worker.worker_name <> 'UKVMAAS#6316'); Process 6236: DELETE FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE'))); to spectacular: 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 ERROR: 40P01: deadlock detected 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 DETAIL: Process 6772 waits for ExclusiveLock on tuple (311,9) of relation 16681 of database 16384; blocked by process 564. Process 564 waits for ShareLock on transaction 72287313; blocked by process 6236. Process 6236 waits for AccessExclusiveLock on tuple (0,19) of relation 16908 of database 16384; blocked by process 6804. Process 6804 waits for ShareLock on transaction 72287062; blocked by process 5244. Process 5244 waits for ShareLock on transaction 72287388; blocked by process 728. Process 728 waits for ExclusiveLock on tuple (311,9) of relation 16681 of database 16384; blocked by process 6772. Process 6772: UPDATE scheduler_task SET status='QUEUED',assigned_worker_name='' WHERE ((scheduler_task.assigned_worker_name IN (SELECT scheduler_worker.worker_name FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:01') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2015-11-16 10:47:01') AND (scheduler_worker.status <> 'ACTIVE'))))) AND (scheduler_task.status = 'RUNNING')); Process 564: UPDATE scheduler_task SET status='QUEUED',assigned_worker_name='' WHERE ((scheduler_task.assigned_worker_name IN (SELECT scheduler_worker.worker_name FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:45') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2015-11-16 10:46:45') AND (scheduler_worker.status <> 'ACTIVE'))))) AND (scheduler_task.status = 'RUNNING')); Process 6236: DELETE FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE'))); Process 6804: UPDATE scheduler_worker SET status='ACTIVE',last_heartbeat='2015-11-16 11:00:36',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, "queue": 0, "empty_runs": 11683, "sleep": 1.0, "distribution": null, "total": 0}' WHERE (scheduler_worker.worker_name = 'UKVMAAS#4280'); Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE (scheduler_worker.worker_name <> 'UKVMAAS#6316'); Process 728: UPDATE scheduler_task SET status='QUEUED',assigned_worker_name='' WHERE ((scheduler_task.assigned_worker_name IN (SELECT scheduler_worker.worker_name FROM scheduler_worker WHERE (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:03') AND (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat < '2015-11-16 10:47:03') AND (scheduler_worker.status <> 'ACTIVE'))))) AND (scheduler_task.status = 'RUNNING')); (from PostgreSQL 9.4.4 logs set up for debug, web2py 2.12.3, Windows 10) This seems to happen more often than you'd hope because the earlier database operations tend to synchronise multiple workers in time if they're already waiting on a lock. The worst case I've found is to set the deadlock timeout in PostgreSQL longer than the heartbeat timeout, so a number of workers are released when the DB times out the deadlocked transaction. This can get stuck in a loop where it immediately recreates the same problem. If this makes sense, is it possible to split up send_heartbeat into more transactions without introducing other problems? Many thanks, Andy S. PS My heartbeats time out because this is a VM that occasionally gets starved of resource, so not web2py's fault :) -- Resources: - http://web2py.com - http://web2py.com/book (Documentation) - http://github.com/web2py/web2py (Source code) - https://code.google.com/p/web2py/issues/list (Report Issues) --- You received this message because you are subscribed to the Google Groups "web2py-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to web2py+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.