Re: stop_machine lockup issue in 3.9.y.

2013-06-06 Thread Tejun Heo
On Thu, Jun 06, 2013 at 02:15:40PM -0700, Ben Greear wrote: > >First of all, kudos for tracking the issue down. While the removal of > >looping limit in softirq handling was the direct cause for making the > >problem visible, it's very bothering that we have softirq runaway. > >Finding out the per

Re: stop_machine lockup issue in 3.9.y.

2013-06-06 Thread Ben Greear
On 06/06/2013 01:55 PM, Tejun Heo wrote: Hello, Ben. On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: On 06/05/2013 08:26 PM, Eric Dumazet wrote: On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: Ah, so, that's why it's showing up now. We probably have had the same issue all alo

Re: stop_machine lockup issue in 3.9.y.

2013-06-06 Thread Tejun Heo
Hello, Ben. On Wed, Jun 05, 2013 at 08:41:01PM -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > >On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > >>Ah, so, that's why it's showing up now. We probably have had the same > >>issue all along but it used to be masked by th

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Eric Dumazet
On Wed, 2013-06-05 at 20:50 -0700, Ben Greear wrote: > On 06/05/2013 08:46 PM, Eric Dumazet wrote: > > > > We use in Google a patch triggering warning is a thread holds the cpu > > without taking care to need_resched() for more than xx ms > > Well, I'm sure that patch works nicely until the clock

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Ben Greear
On 06/05/2013 08:46 PM, Eric Dumazet wrote: On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: On 06/05/2013 08:26 PM, Eric Dumazet wrote: On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: Ah, so, that's why it's showing up now. We probably have had the same issue all along but it used

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Eric Dumazet
On Wed, 2013-06-05 at 20:41 -0700, Ben Greear wrote: > On 06/05/2013 08:26 PM, Eric Dumazet wrote: > > On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > > >> > >> Ah, so, that's why it's showing up now. We probably have had the same > >> issue all along but it used to be masked by the softir

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Ben Greear
On 06/05/2013 08:26 PM, Eric Dumazet wrote: On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: Ah, so, that's why it's showing up now. We probably have had the same issue all along but it used to be masked by the softirq limiting. Do you care to revive the 10 iterations limit so that it's l

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Eric Dumazet
On Wed, 2013-06-05 at 20:14 -0700, Tejun Heo wrote: > > Ah, so, that's why it's showing up now. We probably have had the same > issue all along but it used to be masked by the softirq limiting. Do > you care to revive the 10 iterations limit so that it's limited by > both the count and timing?

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Tejun Heo
Hello, Eric. On Wed, Jun 05, 2013 at 06:34:52PM -0700, Eric Dumazet wrote: > > Ingo, Thomas, we're seeing a stop_machine hanging because > > > > * All other CPUs entered IRQ disabled stage. Jiffies is not being > > updated. > > > > * The last CPU get caught up executing softirq indefinitely.

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Eric Dumazet
On Wed, 2013-06-05 at 14:11 -0700, Tejun Heo wrote: > (cc'ing wireless crowd, tglx and Ingo. The original thread is at > http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) > > Hello, Ben. > > On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > > Hmm, wonder if I found it

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Ben Greear
On 06/05/2013 02:11 PM, Tejun Heo wrote: (cc'ing wireless crowd, tglx and Ingo. The original thread is at http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) Hello, Ben. On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: Hmm, wonder if I found it. I previously saw tim

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Tejun Heo
(cc'ing wireless crowd, tglx and Ingo. The original thread is at http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 ) Hello, Ben. On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote: > Hmm, wonder if I found it. I previously saw times where it appears > jiffies does not incr

Re: stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Ben Greear
On 06/05/2013 12:31 PM, Ben Greear wrote: This is no longer really about the module unlink, so changing subject. On 06/05/2013 12:11 PM, Ben Greear wrote: On 06/05/2013 11:48 AM, Tejun Heo wrote: Hello, Ben. On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: One pattern I notice rep

stop_machine lockup issue in 3.9.y.

2013-06-05 Thread Ben Greear
This is no longer really about the module unlink, so changing subject. On 06/05/2013 12:11 PM, Ben Greear wrote: On 06/05/2013 11:48 AM, Tejun Heo wrote: Hello, Ben. On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote: One pattern I notice repeating for at least most of the hangs is th