[web2py] Re: Opportunities for deadlock in scheduler.py send_heartbeat?

Niphlod Fri, 20 Nov 2015 00:12:40 -0800

unfortunately that kind of fix is not available for all adapters. 15 
workers with a 1 second heartbeat is clearly something that would need 
something ELSE than a database to do IPC.


On Friday, November 20, 2015 at 7:52:05 AM UTC+1, Andy Southgate wrote:
>
> Thanks for there reply.  I'm using 15 workers with a one second heartbeat. 
>  I don't think this is fundamentally a load problem though - it looks like 
> a case of two things taking the same locks in a different order.  Maybe the 
> situation mentioned here 
> http://www.postgresql.org/docs/9.4/static/explicit-locking.html section 
> 13.3.4 'deadlocks can also occur as the result of row-level locks'.
>
> I've been running an experimental workaround overnight, adding this:
>
> if counter % 5 == 0 or mybackedstatus == PICK:
>     try:
>         *db.commit()*
> *        db.executesql("LOCK TABLE scheduler_worker;")*
>         # delete dead workers
>
> and haven't seen a recurrence yet.  If this is the problem, maybe if the 
> first update touched *all* of the rows it would achieve the same thing, 
> in a more compatible way.
>
> Andy
>
> On Tuesday, 17 November 2015 11:02:07 UTC, Niphlod wrote:
>>
>> how many workers you have and what heartbeat are you using ?
>>
>> it's no news that the scheduler_worker table is the most "congested" 
>> because it's where most of the IPC happens. That's why I should release a 
>> version that does IPC on redis, but alas, I didn't find the time yet to 
>> polish the code.
>>
>> BTW: as it is the code can't be made "less pushy": every transaction you 
>> see is necessary to avoid bugs
>>
>> On Tuesday, November 17, 2015 at 1:47:00 AM UTC+1, Andy Southgate wrote:
>>>
>>> Hi all,
>>>
>>> First of all, many thanks to all involved in web2py, it's been working 
>>> out very well for me :)
>>>
>>> I've been using the scheduler a lot, and I think I've found a source of 
>>> database deadlocks.  As far as I can tell, it happens when the system is 
>>> deleting what it thinks are dead workers because heartbeats have timed out, 
>>> and electing a new ticker, and two or more worker processes are trying to 
>>> do this at the same time.  They each update there own heartbeats and then 
>>> do several update/delete operations on a number of the scheduler_worker 
>>> rows, without intervening db.commits, and the 'many rows' operations 
>>> collide with each other.  These can range from simple:
>>>
>>> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 ERROR:  40P01: deadlock 
>>> detected
>>> 2015-11-16 11:02:33 GMT PID=5244 trans=72287062 DETAIL:  Process 5244 
>>> waits for ShareLock on transaction 72287313; blocked by process 6236.
>>> Process 6236 waits for ShareLock on transaction 72287062; blocked by 
>>> process 5244.
>>> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE 
>>> (scheduler_worker.worker_name <> 'UKVMAAS#6316');
>>> Process 6236: DELETE FROM scheduler_worker WHERE 
>>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND 
>>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
>>> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));
>>>
>>> to spectacular:
>>>
>>> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 ERROR:  40P01: deadlock 
>>> detected
>>> 2015-11-16 11:02:16 GMT PID=6772 trans=72287377 DETAIL:  Process 6772 
>>> waits for ExclusiveLock on tuple (311,9) of relation 16681 of database 
>>> 16384; blocked by process 564.
>>> Process 564 waits for ShareLock on transaction 72287313; blocked by 
>>> process 6236.
>>> Process 6236 waits for AccessExclusiveLock on tuple (0,19) of relation 
>>> 16908 of database 16384; blocked by process 6804.
>>> Process 6804 waits for ShareLock on transaction 72287062; blocked by 
>>> process 5244.
>>> Process 5244 waits for ShareLock on transaction 72287388; blocked by 
>>> process 728.
>>> Process 728 waits for ExclusiveLock on tuple (311,9) of relation 16681 
>>> of database 16384; blocked by process 6772.
>>> Process 6772: UPDATE scheduler_task SET 
>>> status='QUEUED',assigned_worker_name='' WHERE 
>>> ((scheduler_task.assigned_worker_name IN (SELECT 
>>>  scheduler_worker.worker_name FROM scheduler_worker WHERE 
>>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:01') AND 
>>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
>>> < '2015-11-16 10:47:01') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
>>> (scheduler_task.status = 'RUNNING'));
>>> Process 564: UPDATE scheduler_task SET 
>>> status='QUEUED',assigned_worker_name='' WHERE 
>>> ((scheduler_task.assigned_worker_name IN (SELECT 
>>>  scheduler_worker.worker_name FROM scheduler_worker WHERE 
>>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:45') AND 
>>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
>>> < '2015-11-16 10:46:45') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
>>> (scheduler_task.status = 'RUNNING'));
>>> Process 6236: DELETE FROM scheduler_worker WHERE 
>>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:00:44') AND 
>>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
>>> < '2015-11-16 10:46:44') AND (scheduler_worker.status <> 'ACTIVE')));
>>> Process 6804: UPDATE scheduler_worker SET 
>>> status='ACTIVE',last_heartbeat='2015-11-16 
>>> 11:00:36',worker_stats='{"status": "ACTIVE", "errors": 0, "workers": 0, 
>>> "queue": 0, "empty_runs": 11683, "sleep": 1.0, "distribution": null, 
>>> "total": 0}' WHERE (scheduler_worker.worker_name = 'UKVMAAS#4280');
>>> Process 5244: UPDATE scheduler_worker SET is_ticker='F' WHERE 
>>> (scheduler_worker.worker_name <> 'UKVMAAS#6316');
>>> Process 728: UPDATE scheduler_task SET 
>>> status='QUEUED',assigned_worker_name='' WHERE 
>>> ((scheduler_task.assigned_worker_name IN (SELECT 
>>>  scheduler_worker.worker_name FROM scheduler_worker WHERE 
>>> (((scheduler_worker.last_heartbeat < '2015-11-16 11:01:03') AND 
>>> (scheduler_worker.status = 'ACTIVE')) OR ((scheduler_worker.last_heartbeat 
>>> < '2015-11-16 10:47:03') AND (scheduler_worker.status <> 'ACTIVE'))))) AND 
>>> (scheduler_task.status = 'RUNNING'));
>>> (from PostgreSQL 9.4.4 logs set up for debug, web2py 2.12.3, Windows 10)
>>>
>>> This seems to happen more often than you'd hope because the earlier 
>>> database operations tend to synchronise multiple workers in time if they're 
>>> already waiting on a lock.  The worst case I've found is to set the 
>>> deadlock timeout in PostgreSQL longer than the heartbeat timeout, so a 
>>> number of workers are released when the DB times out the deadlocked 
>>> transaction.  This can get stuck in a loop where it immediately recreates 
>>> the same problem.
>>>
>>> If this makes sense, is it possible to split up send_heartbeat into more 
>>> transactions without introducing other problems?
>>>
>>> Many thanks,
>>>
>>> Andy S.
>>>
>>> PS My heartbeats time out because this is a VM that occasionally gets 
>>> starved of resource, so not web2py's fault :)
>>>
>>>

-- 
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
--- 
You received this message because you are subscribed to the Google Groups 
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to web2py+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[web2py] Re: Opportunities for deadlock in scheduler.py send_heartbeat?

Reply via email to