On Fri, Feb 16, 2007 at 08:06:25AM -0800, Ben Greear wrote: ... > Well, I had lockdep and all of the locking debugging I could find > enabled, but > it did not catch this problem..I had to use sysctl -t and manually dig > through the backtraces > to find the deadlock.... > > It may be that lockdep could be enhanced to catch this sort of thing....
I think you are really good at traceing very interesting (subtle) problems. I guess the scenario is like this: 1) some process takes some lock (e.g. RTNL), 2) kthread runs a work function, which tries to get the same lock, 3) the process with the lock calls flush_scheduled_work, 4) the flush_cpu_workqueue waits for kthread to finish. So, the process #1 (with the lock) waits for the end of the process #2, which waits for the lock held by process #1. Of course it's a lockup - similar to circular dependency but not the same: there is only one lock. I don't think lockdep could be blamed here - if it's not a lock it can't know the reason of process' #1 waiting. In my opinion the solution should be looked for in the workqueue code. My idea is: maybe there should be used some additional lock taken by kthread before running the workqueue and by a process calling the flush. Then lockdep shouldn't have any problems with this dependency. This lock could be #ifdef DEBUG_LOCK... so only where it could be analyzed. Of course there may be some simpler solution of this otherwise hard to track problem. I CC this message to Ingo Molnar and hope he could find some time to think about it. Regards, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html