Re: frequent softlockups with 3.10rc6.

2013-07-04 Thread Andrew Morton
On Wed, 3 Jul 2013 14:49:01 +1000 Dave Chinner wrote: > On Tue, Jul 02, 2013 at 08:28:42PM -0700, Linus Torvalds wrote: > > On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner wrote: > > >> > > >> Then that test would become > > >> > > >> if (wbc->sync_mode == WB_SYNC_SINGLE) { > > >> > > >> in

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Dave Chinner
On Tue, Jul 02, 2013 at 08:28:42PM -0700, Linus Torvalds wrote: > On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner wrote: > >> > >> Then that test would become > >> > >> if (wbc->sync_mode == WB_SYNC_SINGLE) { > >> > >> instead, and now "sync_mode" would actually describe what mode of > >> syn

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Linus Torvalds
On Tue, Jul 2, 2013 at 8:07 PM, Dave Chinner wrote: >> >> Then that test would become >> >> if (wbc->sync_mode == WB_SYNC_SINGLE) { >> >> instead, and now "sync_mode" would actually describe what mode of >> syncing the caller wants, without that hacky special "we know what the >> caller _r

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Dave Chinner
On Tue, Jul 02, 2013 at 10:38:20AM -0700, Linus Torvalds wrote: > On Tue, Jul 2, 2013 at 9:57 AM, Jan Kara wrote: > > > > sync(2) was always slow in presence of heavy concurrent IO so I don't > > think this is a stable material. > > It's not the "sync being slow" part I personally react to. I d

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Linus Torvalds
On Tue, Jul 2, 2013 at 9:57 AM, Jan Kara wrote: > > sync(2) was always slow in presence of heavy concurrent IO so I don't > think this is a stable material. It's not the "sync being slow" part I personally react to. I don't care that much about that. It's the "sync slows down other things" par

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Jan Kara
On Tue 02-07-13 09:13:43, Linus Torvalds wrote: > On Tue, Jul 2, 2013 at 7:05 AM, Jan Kara wrote: > > On Tue 02-07-13 22:38:35, Dave Chinner wrote: > >> > >> IOWs, sync is 7-8x faster on a busy filesystem and does not have an > >> adverse impact on ongoing async data write operations. > > The pa

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Linus Torvalds
On Tue, Jul 2, 2013 at 7:05 AM, Jan Kara wrote: > On Tue 02-07-13 22:38:35, Dave Chinner wrote: >> >> IOWs, sync is 7-8x faster on a busy filesystem and does not have an >> adverse impact on ongoing async data write operations. > The patch looks good. You can add: > Reviewed-by: Jan Kara Ok, I

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Jan Kara
On Tue 02-07-13 22:38:35, Dave Chinner wrote: > > As a bonus filesystems could also optimize their write_inode() methods when > > they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have > > to do the stupid ext4_force_commit() after each written inode in > > WB_SYNC_ALL mode. >

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Dave Chinner
On Tue, Jul 02, 2013 at 10:19:37AM +0200, Jan Kara wrote: > On Tue 02-07-13 16:29:54, Dave Chinner wrote: > > > > We could, but we just end up in the same place with sync as we are > > > > now - with a long list of clean inodes with a few inodes hidden in > > > > it that are under IO. i.e. we still

Re: frequent softlockups with 3.10rc6.

2013-07-02 Thread Jan Kara
On Tue 02-07-13 16:29:54, Dave Chinner wrote: > > > We could, but we just end up in the same place with sync as we are > > > now - with a long list of clean inodes with a few inodes hidden in > > > it that are under IO. i.e. we still have to walk lots of clean > > > inodes to find the dirty ones th

Re: frequent softlockups with 3.10rc6.

2013-07-01 Thread Dave Chinner
On Mon, Jul 01, 2013 at 02:00:37PM +0200, Jan Kara wrote: > On Sat 29-06-13 13:39:24, Dave Chinner wrote: > > On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote: > > > On Fri 28-06-13 13:58:25, Dave Chinner wrote: > > > > writeback: store inodes under writeback on a separate list > > > > > >

Re: frequent softlockups with 3.10rc6.

2013-07-01 Thread Pavel Machek
On Sat 2013-06-29 19:44:49, Dave Jones wrote: > On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote: > > > > So with that patch, those two boxes have now been fuzzing away for > > > over 24hrs without seeing that specific sync related bug. > > > > Ok, so at least that confirms that

Re: frequent softlockups with 3.10rc6.

2013-07-01 Thread Jan Kara
On Sat 29-06-13 13:39:24, Dave Chinner wrote: > On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote: > > On Fri 28-06-13 13:58:25, Dave Chinner wrote: > > > writeback: store inodes under writeback on a separate list > > > > > > From: Dave Chinner > > > > > > When there are lots of cached in

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Dave Chinner
On Sun, Jun 30, 2013 at 12:05:31PM +1000, Dave Chinner wrote: > On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote: > > On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones wrote: > > > > > > So with that patch, those two boxes have now been fuzzing away for > > > over 24hrs without seeing that

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Dave Chinner
On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote: > On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones wrote: > > > > So with that patch, those two boxes have now been fuzzing away for > > over 24hrs without seeing that specific sync related bug. > > Ok, so at least that confirms that yes,

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Steven Rostedt
On Sat, 2013-06-29 at 19:44 -0400, Dave Jones wrote: > Yeah, this is running as a user. Those don't sound like things that should > be possible. What instrumentation could I add to figure out why > that kthread got awakened ? trace-cmd record -e sched_wakeup -f 'comm ~ "migrati*"' Add "-O sta

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Steven Rostedt
On Sat, 2013-06-29 at 15:23 -0700, Linus Torvalds wrote: > Does the machine recover? Because if it does, I'd be inclined to just > ignore it. Although it would be interesting to hear what triggers this > - normal users - and I'm assuming you're still running trinity as > non-root - generally shoul

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Dave Jones
On Sat, Jun 29, 2013 at 03:23:48PM -0700, Linus Torvalds wrote: > > So with that patch, those two boxes have now been fuzzing away for > > over 24hrs without seeing that specific sync related bug. > > Ok, so at least that confirms that yes, the problem is the excessive > contention on inode_

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Linus Torvalds
On Sat, Jun 29, 2013 at 1:13 PM, Dave Jones wrote: > > So with that patch, those two boxes have now been fuzzing away for > over 24hrs without seeing that specific sync related bug. Ok, so at least that confirms that yes, the problem is the excessive contention on inode_sb_list_lock. Ugh. There'

Re: frequent softlockups with 3.10rc6.

2013-06-29 Thread Dave Jones
On Fri, Jun 28, 2013 at 01:58:25PM +1000, Dave Chinner wrote: > > Oh, that's easy enough to fix. It's just changing the wait_sb_inodes > > loop to use a spin_trylock(&inode->i_lock), moving the inode to > > the end of the sync list, dropping all locks and starting again... > > New version be

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Dave Chinner
On Fri, Jun 28, 2013 at 12:28:19PM +0200, Jan Kara wrote: > On Fri 28-06-13 13:58:25, Dave Chinner wrote: > > writeback: store inodes under writeback on a separate list > > > > From: Dave Chinner > > > > When there are lots of cached inodes, a sync(2) operation walks all > > of them to try to fi

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Jan Kara
On Fri 28-06-13 13:58:25, Dave Chinner wrote: > On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote: > > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote: > > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote: > > > > > > > > > > > Yup, that's about three of ord

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Jan Kara
On Thu 27-06-13 19:59:50, Linus Torvalds wrote: > On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner wrote: > > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: > >> > >> So what made it all start happening now? I don't recall us having had > >> these kinds of issues before.. > > > > Not

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Al Viro
On Thu, Jun 27, 2013 at 10:22:45PM -1000, Linus Torvalds wrote: > > It looks ok, but I still think it is solving the wrong problem. > > FWIW, your optimisation has much wider application that just this > > one place. I'll have a look to see how we can apply this approach > > across all the inode l

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Linus Torvalds
On Thu, Jun 27, 2013 at 9:21 PM, Dave Chinner wrote: > > Besides, making the inode_sb_list_lock per sb won't help solve this > problem, anyway. The case that I'm testing involves a filesystem > that contains 99.97% of all inodes cached by the system. This is a > pretty common situation Yeah..

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Al Viro
On Thu, Jun 27, 2013 at 07:59:50PM -1000, Linus Torvalds wrote: > Also, looking some more now at that wait_sb_inodes logic, I have to > say that if the problem is primarily the inode->i_lock, then that's > just crazy. Looks more like contention on inode_sb_list_lock, actually... > And no, I don

Re: frequent softlockups with 3.10rc6.

2013-06-28 Thread Dave Chinner
On Thu, Jun 27, 2013 at 07:59:50PM -1000, Linus Torvalds wrote: > On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner wrote: > > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: > >> > >> So what made it all start happening now? I don't recall us having had > >> these kinds of issues befo

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Linus Torvalds
On Thu, Jun 27, 2013 at 5:54 PM, Dave Chinner wrote: > On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: >> >> So what made it all start happening now? I don't recall us having had >> these kinds of issues before.. > > Not sure - it's a sudden surprise for me, too. Then again, I have

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Fri, Jun 28, 2013 at 11:13:01AM +1000, Dave Chinner wrote: > On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote: > > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote: > > > > > > > > Yup, that's about three of orders of magnitude faster on this > > > > workload > >

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Thu, Jun 27, 2013 at 04:54:53PM -1000, Linus Torvalds wrote: > On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner wrote: > > > > Right, that will be what is happening - the entire system will go > > unresponsive when a sync call happens, so it's entirely possible > > to see the soft lockups on inode

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Linus Torvalds
On Thu, Jun 27, 2013 at 3:18 PM, Dave Chinner wrote: > > Right, that will be what is happening - the entire system will go > unresponsive when a sync call happens, so it's entirely possible > to see the soft lockups on inode_sb_list_add()/inode_sb_list_del() > trying to get the lock because of the

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Thu, Jun 27, 2013 at 10:30:55AM -0400, Dave Jones wrote: > On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote: > > > Is this just a soft lockup warning? Or is the system hung? > > I've only seen it completely lock up the box 2-3 times out of dozens > of times I've seen this, and t

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Thu, Jun 27, 2013 at 11:21:51AM -0400, Dave Jones wrote: > On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote: > > > > > Yup, that's about three of orders of magnitude faster on this > > > workload > > > > > > Lightly smoke tested patch below - it passed the first round of

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Jones
On Thu, Jun 27, 2013 at 10:52:18PM +1000, Dave Chinner wrote: > > Yup, that's about three of orders of magnitude faster on this > > workload > > > > Lightly smoke tested patch below - it passed the first round of > > XFS data integrity tests in xfstests, so it's not completely > > bu

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Jones
On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote: > Is this just a soft lockup warning? Or is the system hung? I've only seen it completely lock up the box 2-3 times out of dozens of times I've seen this, and tbh that could have been a different bug. > I mean, what you see here i

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Thu, Jun 27, 2013 at 08:06:12PM +1000, Dave Chinner wrote: > On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote: > > On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote: > > > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > > > > On 06/25, Dave Jones wrote: > >

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Thu, Jun 27, 2013 at 05:55:43PM +1000, Dave Chinner wrote: > On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote: > > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > > > On 06/25, Dave Jones wrote: > > > > > > > > Took a lot longer to trigger this time. (13 hours of ru

Re: frequent softlockups with 3.10rc6.

2013-06-27 Thread Dave Chinner
On Wed, Jun 26, 2013 at 08:22:55PM -0400, Dave Jones wrote: > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > > On 06/25, Dave Jones wrote: > > > > > > Took a lot longer to trigger this time. (13 hours of runtime). > > > > And _perhaps_ this means that 3.10-rc7 without 8aac62

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Steven Rostedt
On Wed, 2013-06-26 at 16:00 -0400, Dave Jones wrote: > On Wed, Jun 26, 2013 at 03:52:15PM -0400, Steven Rostedt wrote: > Yeah, that's what I meant by "this patch". > To reduce ambiguity, I mean the one below.. There wasn't another patch > that I missed right ? > On other patch, but I've found is

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Tejun Heo
Hello, On Wed, Jun 26, 2013 at 06:06:45PM -0700, Eric W. Biederman wrote: > Just based on the last trace and your observation that it seems to be > vfs/block layer related I am going to mildly suggest that Jens and Tejun > might have a clue. Tejun made a transformation of the threads used for > w

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Eric W. Biederman
Dave Jones writes: > On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > > On 06/25, Dave Jones wrote: > > > > > > Took a lot longer to trigger this time. (13 hours of runtime). > > > > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more > > time to hit the same

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Dave Jones
On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > On 06/25, Dave Jones wrote: > > > > Took a lot longer to trigger this time. (13 hours of runtime). > > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more > time to hit the same bug ;) Ok, that didn't take long.

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Dave Jones
On Wed, Jun 26, 2013 at 03:52:15PM -0400, Steven Rostedt wrote: > > > Hmm, no it needs a fix to make this work. I applied a patch below that > > > should do this correctly (and will put this into my 3.11 queue). > > > > > > If you run the test again with this change and with the above fil

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Steven Rostedt
On Wed, 2013-06-26 at 01:23 -0400, Dave Jones wrote: > On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote: > > > Now, what we can try to do as well, is to add a trigger to disable > > tracing, which should (I need to check the code) stop tracing on printk. > > To do so: > > > > #

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Dave Jones
On Wed, Jun 26, 2013 at 09:18:53PM +0200, Oleg Nesterov wrote: > On 06/25, Dave Jones wrote: > > > > Took a lot longer to trigger this time. (13 hours of runtime). > > And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more > time to hit the same bug ;) > > Dave, I am not going

Re: frequent softlockups with 3.10rc6.

2013-06-26 Thread Oleg Nesterov
On 06/25, Dave Jones wrote: > > Took a lot longer to trigger this time. (13 hours of runtime). And _perhaps_ this means that 3.10-rc7 without 8aac6270 needs more time to hit the same bug ;) Dave, I am not going to "deny the problem". We should investigate it anyway. And yes, 8aac6270 is not as tr

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote: > On Tue, 2013-06-25 at 11:35 -0400, Dave Jones wrote: > > Took a lot longer to trigger this time. (13 hours of runtime). > > > > This trace may still not be from the first lockup, as a flood of > > them happened at the same time

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
On Tue, Jun 25, 2013 at 12:23:34PM -0400, Steven Rostedt wrote: > Now, what we can try to do as well, is to add a trigger to disable > tracing, which should (I need to check the code) stop tracing on printk. > To do so: > > # echo printk:traceoff > /sys/kernel/debug/tracing/set_ftrace_filter

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
On Tue, Jun 25, 2013 at 01:29:54PM -0400, Steven Rostedt wrote: > On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote: > > On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote: > > > > > While I've been spinning wheels trying to reproduce that softlockup bug, > > > On another machine I've

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 13:26 -0400, Dave Jones wrote: > > > What's the above saying? 880243288000->prev == 0088023c6cdd but > > it should have been 88023c6cdd18? That is: 88023c6cdd18->next == > > 880243288001? > > It's saying something has done >>8 on a pointer, and stuck

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 13:26 -0400, Dave Jones wrote: > > Not sure how that would mess up. The ring-buffer code has lots of > > integrity checks to make sure nothing like this breaks. > > My integrity checks can beat up your integrity checks. I don't know. It looks like my code is beating up yo

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote: > On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote: > > > While I've been spinning wheels trying to reproduce that softlockup bug, > > On another machine I've been refining my list-walk debug patch. > > I added an ugly "ok, the ringbuffer

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
On Tue, Jun 25, 2013 at 01:21:30PM -0400, Steven Rostedt wrote: > On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote: > > > While I've been spinning wheels trying to reproduce that softlockup bug, > > On another machine I've been refining my list-walk debug patch. > > I added an ugly "ok, th

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 13:21 -0400, Steven Rostedt wrote: > Not sure how that would mess up. The ring-buffer code has lots of > integrity checks to make sure nothing like this breaks. See rb_check_pages() and rb_check_list(). -- Steve -- To unsubscribe from this list: send the line "unsubscribe

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 12:55 -0400, Dave Jones wrote: > While I've been spinning wheels trying to reproduce that softlockup bug, > On another machine I've been refining my list-walk debug patch. > I added an ugly "ok, the ringbuffer is playing games with lower two bits" > special case. > > But wh

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
On Mon, Jun 24, 2013 at 01:04:36PM -0400, Steven Rostedt wrote: > On Mon, 2013-06-24 at 12:51 -0400, Dave Jones wrote: > > On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote: > > > > > > Ah, this is the first victim of my new 'check sanity of nodes during > > list walks' patch.

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Steven Rostedt
On Tue, 2013-06-25 at 11:35 -0400, Dave Jones wrote: > Took a lot longer to trigger this time. (13 hours of runtime). > > This trace may still not be from the first lockup, as a flood of > them happened at the same time. > > > # tracer: preemptirqsoff > # > # preemptirqsoff latency trace v1.1.5

Re: frequent softlockups with 3.10rc6.

2013-06-25 Thread Dave Jones
Took a lot longer to trigger this time. (13 hours of runtime). This trace may still not be from the first lockup, as a flood of them happened at the same time. # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 3.10.0-rc7+ # -

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Mon, Jun 24, 2013 at 01:53:11PM -0400, Steven Rostedt wrote: > > Also. watchdog_timer_fn() calls printk() only if it detects the > > lockup, so I assume you hit another one? > > Probably. Yeah, unfortunately it happened while I was travelling home to the box, so I couldn't stop it after t

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Steven Rostedt
On Mon, 2013-06-24 at 19:35 +0200, Oleg Nesterov wrote: > On 06/24, Dave Jones wrote: > > > > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > > > > > Could you please do the following: > > > > > > 1. # cd /sys/kernel/debug/tracing > > > # echo 0 >> options/function-trac

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Mon, Jun 24, 2013 at 07:35:10PM +0200, Oleg Nesterov wrote: > > Not sure this is helpful, but.. > > This makes me think that something is seriously broken. > > Or I do not understand this stuff at all. Quite possible too. > Steven, could you please help? > > But this is already call

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Oleg Nesterov
On 06/24, Dave Jones wrote: > > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > > > Could you please do the following: > > > >1. # cd /sys/kernel/debug/tracing > > # echo 0 >> options/function-trace > > # echo preemptirqsoff >> current_tracer > > > >2.

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Steven Rostedt
On Mon, 2013-06-24 at 12:51 -0400, Dave Jones wrote: > On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote: > > > > Ah, this is the first victim of my new 'check sanity of nodes during > list walks' patch. > > > It's doing the same prev->next next->prev checking as list_add and > fr

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Mon, Jun 24, 2013 at 12:24:39PM -0400, Steven Rostedt wrote: > > Ah, this is the first victim of my new 'check sanity of nodes during list > > walks' patch. > > It's doing the same prev->next next->prev checking as list_add and friends. > > I'm looking at getting it into shape for a 3.12 m

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Mon, Jun 24, 2013 at 06:37:08PM +0200, Oleg Nesterov wrote: > On 06/24, Dave Jones wrote: > > > > On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote: > > > > > > > check_list_nodes corruption. next->prev should be prev > > (88023b8a1a08), but was 0088023b8a1a. (next=

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Oleg Nesterov
On 06/24, Dave Jones wrote: > > On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote: > > > > > check_list_nodes corruption. next->prev should be prev > (88023b8a1a08), but was 0088023b8a1a. (next=880243288001). > > > > > > Can't find "check_list_nodes" in lib/list_debug.

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Steven Rostedt
On Mon, 2013-06-24 at 12:00 -0400, Dave Jones wrote: > On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote: > > > > > check_list_nodes corruption. next->prev should be prev > (88023b8a1a08), but was 0088023b8a1a. (next=880243288001). > > > > > > Can't find "check_list

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Mon, Jun 24, 2013 at 10:52:29AM -0400, Steven Rostedt wrote: > > > check_list_nodes corruption. next->prev should be prev > > > (88023b8a1a08), but was 0088023b8a1a. (next=880243288001). > > > > Can't find "check_list_nodes" in lib/list_debug.c or elsewhere... > > > > >

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Dave Jones
On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > [11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! > > [trinity-child2:14482] > > [11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp > > bnep rfcomm can_raw ipt_ULOG can_bcm nfnetlink af_rxrpc llc2 ro

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Steven Rostedt
On Mon, 2013-06-24 at 16:39 +0200, Oleg Nesterov wrote: > On 06/23, Dave Jones wrote: > > > > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > > > > Could you please do the following: > > > > > > 1. # cd /sys/kernel/debug/tracing > > > # echo 0 >> options/function-trace >

Re: frequent softlockups with 3.10rc6.

2013-06-24 Thread Oleg Nesterov
On 06/23, Dave Jones wrote: > > On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > > Could you please do the following: > > > >1. # cd /sys/kernel/debug/tracing > > # echo 0 >> options/function-trace > > # echo preemptirqsoff >> current_tracer > > dammit. > > WA

Re: frequent softlockups with 3.10rc6.

2013-06-23 Thread Dave Jones
On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > Could you please do the following: > > 1. # cd /sys/kernel/debug/tracing > # echo 0 >> options/function-trace > # echo preemptirqsoff >> current_tracer dammit. WARNING: at include/linux/list.h:385 rb_hea

Re: frequent softlockups with 3.10rc6.

2013-06-23 Thread Dave Jones
On Sun, Jun 23, 2013 at 06:04:52PM +0200, Oleg Nesterov wrote: > > [11018.927809] [sched_delayed] sched: RT throttling activated > > [11054.897670] BUG: soft lockup - CPU#2 stuck for 22s! > > [trinity-child2:14482] > > [11054.898503] Modules linked in: bridge stp snd_seq_dummy tun fuse hidp

Re: frequent softlockups with 3.10rc6.

2013-06-23 Thread Oleg Nesterov
On 06/23, Dave Jones wrote: > > On Sun, Jun 23, 2013 at 04:36:34PM +0200, Oleg Nesterov wrote: > > > > > Dave, I am sorry but all I can do is to ask you to do more testing. > > > > Could you please reproduce the lockup again on the clean Linus's > > > > current ? (and _without_ reverting 8aac

Re: frequent softlockups with 3.10rc6.

2013-06-23 Thread Dave Jones
On Sun, Jun 23, 2013 at 04:36:34PM +0200, Oleg Nesterov wrote: > > > Dave, I am sorry but all I can do is to ask you to do more testing. > > > Could you please reproduce the lockup again on the clean Linus's > > > current ? (and _without_ reverting 8aac6270, of course). > > > > I'll give

Re: frequent softlockups with 3.10rc6.

2013-06-23 Thread Oleg Nesterov
On 06/22, Dave Jones wrote: > > On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote: > > > > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 > get_proc_task_net+0x1c8/0x1d0() > > > > Hmm. The test case tries to create the symlink in /proc/*/net/ ? > > hit it with symlink, but al

Re: frequent softlockups with 3.10rc6.

2013-06-22 Thread Andrew Vagin
On Sat, Jun 22, 2013 at 05:59:05PM -0400, Dave Jones wrote: > On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote: > > > > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 > get_proc_task_net+0x1c8/0x1d0() > > > [ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bn

Re: frequent softlockups with 3.10rc6.

2013-06-22 Thread Dave Jones
On Sat, Jun 22, 2013 at 07:31:29PM +0200, Oleg Nesterov wrote: > > [ 7485.261299] WARNING: at include/linux/nsproxy.h:63 > > get_proc_task_net+0x1c8/0x1d0() > > [ 7485.262021] Modules linked in: 8021q garp stp tun fuse rfcomm bnep hidp > > snd_seq_dummy nfnetlink scsi_transport_iscsi can_bc

Re: frequent softlockups with 3.10rc6.

2013-06-22 Thread Oleg Nesterov
On 06/21, Dave Jones wrote: > > On Fri, Jun 21, 2013 at 09:59:49PM +0200, Oleg Nesterov wrote: > > > I am puzzled. And I do not really understand > > > >hardirqs last enabled at (2380318): [] > restore_args+0x0/0x30 > >hardirqs last disabled at (2380319): [] > apic_timer_interrupt+0x

Re: frequent softlockups with 3.10rc6.

2013-06-21 Thread Dave Jones
On Fri, Jun 21, 2013 at 09:59:49PM +0200, Oleg Nesterov wrote: > I am puzzled. And I do not really understand > > hardirqs last enabled at (2380318): [] > restore_args+0x0/0x30 > hardirqs last disabled at (2380319): [] > apic_timer_interrupt+0x6a/0x80 > softirqs last ena

Re: frequent softlockups with 3.10rc6.

2013-06-21 Thread Oleg Nesterov
On 06/21, Dave Jones wrote: > > On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote: > > > > > > I've been hitting this a lot the last few days. > > > > > > This is the same machine that I was also seeing lockups during > sync() > > > > > > > > > > On a whim, I reverted 9713

Re: frequent softlockups with 3.10rc6.

2013-06-21 Thread Dave Jones
On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote: > > > > > I've been hitting this a lot the last few days. > > > > > This is the same machine that I was also seeing lockups during > > sync() > > > > > > > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3

Re: frequent softlockups with 3.10rc6.

2013-06-20 Thread Dave Jones
On Thu, Jun 20, 2013 at 09:16:52AM -0700, Paul E. McKenney wrote: > On Wed, Jun 19, 2013 at 08:12:12PM -0400, Dave Jones wrote: > > On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote: > > > > On Wed, Jun 19, 2013

Re: frequent softlockups with 3.10rc6.

2013-06-20 Thread Paul E. McKenney
On Wed, Jun 19, 2013 at 08:12:12PM -0400, Dave Jones wrote: > On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote: > > On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote: > > > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote: > > > > I've been hitting this a lot

Re: frequent softlockups with 3.10rc6.

2013-06-19 Thread Dave Jones
On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote: > On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote: > > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote: > > > I've been hitting this a lot the last few days. > > > This is the same machine that I was also

Re: frequent softlockups with 3.10rc6.

2013-06-19 Thread Dave Jones
On Wed, Jun 19, 2013 at 11:13:02AM -0700, Paul E. McKenney wrote: > > On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3 > > (As I started seeing these just after that rcu merge). > > > > It's only been 30 minutes, but it seems stable again. Normally I would > > hit these within

Re: frequent softlockups with 3.10rc6.

2013-06-19 Thread Paul E. McKenney
On Wed, Jun 19, 2013 at 01:53:56PM -0400, Dave Jones wrote: > On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote: > > I've been hitting this a lot the last few days. > > This is the same machine that I was also seeing lockups during sync() > > On a whim, I reverted 971394f389992f8462c4e5

Re: frequent softlockups with 3.10rc6.

2013-06-19 Thread Dave Jones
On Wed, Jun 19, 2013 at 12:45:40PM -0400, Dave Jones wrote: > I've been hitting this a lot the last few days. > This is the same machine that I was also seeing lockups during sync() On a whim, I reverted 971394f389992f8462c4e5ae0e3b49a10a9534a3 (As I started seeing these just after that rcu merg

frequent softlockups with 3.10rc6.

2013-06-19 Thread Dave Jones
I've been hitting this a lot the last few days. This is the same machine that I was also seeing lockups during sync() Dave BUG: soft lockup - CPU#1 stuck for 22s! [trinity-child9:6902] Modules linked in: bridge snd_seq_dummy dlci bnep fuse 8021q garp stp hidp tun rfcomm can_raw ipt_ULOG