unfortunately that kind of fix is not available for all adapters. 15 workers with a 1 second heartbeat is clearly something that would need something ELSE than a database to do IPC.
On Friday, November 20, 2015 at 7:52:05 AM UTC+1, Andy Southgate wrote: > > Thanks for there reply. I'm using 15 workers with a one second heartbeat. > I don't think this is fundamentally a load problem though - it looks like > a case of two things taking the same locks in a different order. Maybe the > situation mentioned here > http://www.postgresql.org/docs/9.4/static/explicit-locking.html section > 13.3.4 'deadlocks can also occur as the result of row-level locks'. > > I've been running an experimental workaround overnight, adding this: > > if counter % 5 == 0 or mybackedstatus == PICK: > try: > *db.commit()* > * db.executesql("LOCK TABLE scheduler_worker;")* > # delete dead workers > > and haven't seen a recurrence yet. If this is the problem, maybe if the > first update touched *all* of the rows it would achieve the same thing, > in a more compatible way. > > Andy > > On Tuesday, 17 November 2015 11:02:07 UTC, Niphlod wrote: >> >> how many workers you have and what heartbeat are you using ? >> >> it's no news that the scheduler_worker table is the most "congested" >> because it's where most of the IPC happens. That's why I should release a >> version that does IPC on redis, but alas, I didn't find the time yet to >> polish the code. >> >> BTW: as it is the code can't be made "less pushy": every transaction you >> see is necessary to avoid bugs >> >> On Tuesday, November 17, 2015 at 1:47:00 AM UTC+1, Andy Southgate wrote: >>> >>> Hi all, >>> >>> First of all, many thanks to all involved in web2py, it's been working >>> out very well for me :) >>> >>> I've been using the scheduler a lot, and I think I've found a source of >>> database deadlocks. As far as I can tell, it happens when the system is >>> deleting what it thinks are dead workers because heartbeats have timed out, >>> and electing a new ticker, and two or more worker processes are trying to >>> do this at the same time. They each update there own heartbeats and then >>> do several update/delete operations on a number of the scheduler_worker >>> rows, without intervening db.commits, and the 'many rows' operations >>> collide with each other. These can range from simple: >>> >>> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 ERROR: 40P01: deadlock >>> detected >>> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 DETAIL: Process 5244 >>> waits for ShareLock on transaction 72287313; blocked by process 6236. >>> Process 6236 waits for ShareLock on transaction 72287062; blocked by >>> process 5244. >>> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE >>> (scheduler_worker.worker_name <> 'UKVMAAS#6316'); >>> Process 6236: DELETE FROM scheduler_worker WHERE >>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND >>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat >>> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE'))); >>> >>> to spectacular: >>> >>> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 ERROR: 40P01: deadlock >>> detected >>> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 DETAIL: Process 6772 >>> waits for ExclusiveLock on tuple (311,9) of relation 16681 of database >>> 16384; blocked by process 564. >>> Process 564 waits for ShareLock on transaction 72287313; blocked by >>> process 6236. >>> Process 6236 waits for AccessExclusiveLock on tuple (0,19) of relation >>> 16908 of database 16384; blocked by process 6804. >>> Process 6804 waits for ShareLock on transaction 72287062; blocked by >>> process 5244. >>> Process 5244 waits for ShareLock on transaction 72287388; blocked by >>> process 728. >>> Process 728 waits for ExclusiveLock on tuple (311,9) of relation 16681 >>> of database 16384; blocked by process 6772. >>> Process 6772: UPDATE scheduler_task SET >>> status='QUEUED',assigned_worker_name='' WHERE >>> ((scheduler_task.assigned_worker_name IN (SELECT >>> scheduler_worker.worker_name FROM scheduler_worker WHERE >>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:01') AND >>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat >>> < '2015-11-16 10:47:01') AND (scheduler_worker.status <> 'ACTIVE'))))) AND >>> (scheduler_task.status = 'RUNNING')); >>> Process 564: UPDATE scheduler_task SET >>> status='QUEUED',assigned_worker_name='' WHERE >>> ((scheduler_task.assigned_worker_name IN (SELECT >>> scheduler_worker.worker_name FROM scheduler_worker WHERE >>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:45') AND >>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat >>> < '2015-11-16 10:46:45') AND (scheduler_worker.status <> 'ACTIVE'))))) AND >>> (scheduler_task.status = 'RUNNING')); >>> Process 6236: DELETE FROM scheduler_worker WHERE >>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND >>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat >>> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE'))); >>> Process 6804: UPDATE scheduler_worker SET >>> status='ACTIVE',last_heartbeat='2015-11-16 >>> 11:00:36',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, >>> "queue": 0, "empty_runs": 11683, "sleep": 1.0, "distribution": null, >>> "total": 0}' WHERE (scheduler_worker.worker_name = 'UKVMAAS#4280'); >>> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE >>> (scheduler_worker.worker_name <> 'UKVMAAS#6316'); >>> Process 728: UPDATE scheduler_task SET >>> status='QUEUED',assigned_worker_name='' WHERE >>> ((scheduler_task.assigned_worker_name IN (SELECT >>> scheduler_worker.worker_name FROM scheduler_worker WHERE >>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:03') AND >>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat >>> < '2015-11-16 10:47:03') AND (scheduler_worker.status <> 'ACTIVE'))))) AND >>> (scheduler_task.status = 'RUNNING')); >>> (from PostgreSQL 9.4.4 logs set up for debug, web2py 2.12.3, Windows 10) >>> >>> This seems to happen more often than you'd hope because the earlier >>> database operations tend to synchronise multiple workers in time if they're >>> already waiting on a lock. The worst case I've found is to set the >>> deadlock timeout in PostgreSQL longer than the heartbeat timeout, so a >>> number of workers are released when the DB times out the deadlocked >>> transaction. This can get stuck in a loop where it immediately recreates >>> the same problem. >>> >>> If this makes sense, is it possible to split up send_heartbeat into more >>> transactions without introducing other problems? >>> >>> Many thanks, >>> >>> Andy S. >>> >>> PS My heartbeats time out because this is a VM that occasionally gets >>> starved of resource, so not web2py's fault :) >>> >>> -- Resources: - http://web2py.com - http://web2py.com/book (Documentation) - http://github.com/web2py/web2py (Source code) - https://code.google.com/p/web2py/issues/list (Report Issues) --- You received this message because you are subscribed to the Google Groups "web2py-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to web2py+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.