Re: [BUG] RTNL and flush_scheduled_work deadlocks

Jarek Poplawski Tue, 20 Feb 2007 00:20:35 -0800

On Fri, Feb 16, 2007 at 08:06:25AM -0800, Ben Greear wrote:
...
> Well, I had lockdep and all of the locking debugging I could find 
> enabled, but
> it did not catch this problem..I had to use sysctl -t and manually dig 
> through the backtraces
> to find the deadlock....
> 
> It may be that lockdep could be enhanced to catch this sort of thing....


I think you are really good at traceing very interesting
(subtle) problems.

I guess the scenario is like this:

1) some process takes some lock (e.g. RTNL), 
2) kthread runs a work function, which tries to get the
   same lock,
3) the process with the lock calls flush_scheduled_work,
4) the flush_cpu_workqueue waits for kthread to finish.

So, the process #1 (with the lock) waits for the end 
of the process #2, which waits for the lock held by
process #1.

Of course it's a lockup - similar to circular dependency
but not the same: there is only one lock. I don't think
lockdep could be blamed here - if it's not a lock it
can't know the reason of process' #1 waiting.

In my opinion the solution should be looked for in the
workqueue code. My idea is: maybe there should be used
some additional lock taken by kthread before running
the workqueue and by a process calling the flush. Then
lockdep shouldn't have any problems with this dependency.
This lock could be #ifdef DEBUG_LOCK... so only where
it could be analyzed. Of course there may be some simpler
solution of this otherwise hard to track problem.

I CC this message to Ingo Molnar and hope he could find
some time to think about it.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] RTNL and flush_scheduled_work deadlocks

Reply via email to