Re: frequent lockups in 3.18rc4

2015-02-12 Thread Linus Torvalds
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es wrote: > > Best I can come up with now is try the next mainline that has all the > fixes and ideas in this thread incorporated. Would that be 3.19? Yes. I'm attaching a patch (very much experimental - it might introduce new problems rather than fix o

Re: frequent lockups in 3.18rc4

2015-02-12 Thread Martin van Es
To follow up on this long standing promise to bisect. I've made two attempts at bisecting and both landed in limbo. It's hard to explain but it feels like this bug has quantum properties; I know for sure it's present in 3.17 and not in 3.16(.7). But once I start bisecting it gets less pronounced.

Re: frequent lockups in 3.18rc4

2015-01-12 Thread Thomas Gleixner
On Sun, 21 Dec 2014, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > > > And finally, and stupidly, is there any chance that you have anything > > > accessing /dev/hpet? > > > > Not knowingly a

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds wrote: > On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: >> >> Anyway, It may be worth keeping the 50% margin (and dropping the 12% >> reduction to simplify things) > > Again, the 50% margin is only on the multiplication overflow. Not on the mask.

Re: frequent lockups in 3.18rc4

2015-01-05 Thread Linus Torvalds
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: > > Anyway, It may be worth keeping the 50% margin (and dropping the 12% > reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. So it won't do anything at all for the case we actually care

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds wrote: > On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: >> >> So I sent out a first step validation check to warn us if we end up >> with idle periods that are larger then we expect. > > .. not having tested it, this is just from reading the pat

Re: frequent lockups in 3.18rc4

2015-01-04 Thread Linus Torvalds
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: > > So I sent out a first step validation check to warn us if we end up > with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't actually validate the cl

Re: frequent lockups in 3.18rc4

2015-01-03 Thread Sasha Levin
On 01/02/2015 07:27 PM, John Stultz wrote: > On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds > wrote: >> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones >> > wrote: >>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >>> >> >>> >> > One thing I think I'll try is to try and narrow

Re: frequent lockups in 3.18rc4

2015-01-02 Thread John Stultz
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >> >> > One thing I think I'll try is to try and narrow down which >> > syscalls are triggering those "Clocksource hpet ha

Re: frequent lockups in 3.18rc4

2014-12-28 Thread Paul E. McKenney
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote: > On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > > > * So 1/8th of the interval seems way too short, as there's > > clocksources like the ACP PM, which wrap every 2.5 seconds or so. > > Ugh. At the same time, 1/8th of a rang

Re: frequent lockups in 3.18rc4

2014-12-27 Thread Dave Jones
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > > "switched to hpet" messages there yet? > > > > not yet. 3 hrs in. > > Ok, so then th

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > "switched to hpet" messages there yet? > > not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do with HPET, s

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I'll check it again in the morning, but before I turn in for th

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I don't think so. > For the last night, can you see if you ca

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > hm. > > So with the previous patch that had the false positives, you never saw > this? You saw the false positives instead? correct. > I'm wondering if the added

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > still running though.. Btw, did you ever boot with "tsc=reliable" as a kernel command line option? For the last night, can you see if you can just run it with that, and things work? Because by now, my gut feel is that we should start deratin

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically cause some scheduler acti

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. hm. [ 2733.047100]

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. I'll give it a spin a

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: > On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > > > One thing I think I'll try is to try and narrow down which > > syscalls are triggering those "Clocksource hpet had cycles off" > > messages. I'm still unclear on exactly what

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > One thing I think I'll try is to try and narrow down which > syscalls are triggering those "Clocksource hpet had cycles off" > messages. I'm still unclear on exactly what is doing > the stomping on the hpet. First I ran trinity wi

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result -

Re: frequent lockups in 3.18rc4

2014-12-24 Thread Sasha Levin
On 12/23/2014 09:56 AM, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result - the original softlockup

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps.

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps.

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > * So 1/8th of the interval seems way too short, as there's > clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is some timer corruption,

Re: frequent lockups in 3.18rc4

2014-12-22 Thread John Stultz
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: >> >> This is *not* to say that this is the bug you're hitting. But it does show >> that >> >> (a) a flaky HPET can do some seriously bad stuff >> (b) the kernel is very fragile w

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones wrote: > > I tried the nohpet thing for a few hours this morning and didn't see > anything weird, but it may have been that I just didn't run long enough. > When I saw your patch, I gave that a shot instead, with hpet enabled > again. Just got back to f

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Dave Jones
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote: > And again: this is not trying to make the kernel clock not jump. There > is no way I can come up with even in theory to try to really *fix* a > fundamentally broken clock. > > So this is not meant to be a real "fix" for anythin

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > > .. and we might still lock up under some circumstances. But at least > from my limited testing, it is infinitely much better, even if it > might not be perfect. Also note that my "testing" has been writing > zero to the HPET lock (so the

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > This is *not* to say that this is the bug you're hitting. But it does show > that > > (a) a flaky HPET can do some seriously bad stuff > (b) the kernel is very fragile wrt time going backwards. > > and maybe we can use this test program

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Paul E. McKenney
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: > > > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the f

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the first time I got a weird > lockup-like thing (things *kind*

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I coul

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds wrote: > > I can do the mmap(/dev/mem) thing and access the HPET by hand, and > when I write zero to it I immediately get something like this: > > Clocksource tsc unstable (delta = -284317725450 ns) > Switched to clocksource hpet > > just to conf

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > And finally, and stupidly, is there any chance that you have anything > > accessing /dev/hpet? > > Not knowingly at least, but who the hell knows what systemd has its > fi

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > > that "32-bit HPET range". > > DaveJ, I assume it's too late now, and you don't effectively have any > access to the machine any more, but "hpet=disable"

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > that "32-bit HPET range". DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but "hpet=disable" or "nohpet" on the com

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds wrote: > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see anything wrong. The whole "update_wall_clock" and the shadow > timekeeping state is confusing as hell, but seems fine. We'd have to > avoid update_wall_cloc

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Paul E. McKenney
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote: > On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds > wrote: > > > > How/where is the HPET overflow case handled? I don't know the code enough. > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see any

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds wrote: > > How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole "update_wall_clock" and the shadow timekeeping state is

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds wrote: > > I'm claiming that the race happened *once*. And it then corrupted some > data structure or similar sufficiently that CPU0 keeps looping. > > Perhaps something keeps re-adding itself to the head of the timerqueue > due to the race. So tick

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Dave Jones
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > > and try without serial console". Shall I try your suggestion > > on top of that ? > > Might as well. I doubt it really will make any difference, but I also > d

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner wrote: > > The watchdog timer runs on a fully periodic schedule. It's self > rearming via > > hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); > > So if that aligns with the equally periodic tick interrupt on the > other CPU then y

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: > > But at the very end this would be detected by the runtime check of the > > hrtimer interrupt, which does not trigger. And it would trigger at > > some point as ALL cpus including CPU0 in that trac

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > > staring some more at it, I think we are looking at it from the wrong > > angle. > > > > The watchdog always detects CP

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: On Fri, 19 Dec 2014, Chris Mason wrote: On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI,

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > staring some more at it, I think we are looking at it from the wrong > angle. > > The watchdog always detects CPU1 as stuck and we got completely > fixated on the cs

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > > Here's another pattern. In your latest thing, every single time that > > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > > doing this: > > > > [24998.060963] NMI back

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > and try without serial console". Shall I try your suggestion > on top of that ? Might as well. I doubt it really will make any difference, but I also don't think it will interact

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones
On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds > wrote: > > > > I do note that we depend on the "new mwait" semantics where we do > > mwait with interrupts disabled and a non-zero RCX value. Are there > > possibly even any k

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds wrote: > > I do note that we depend on the "new mwait" semantics where we do > mwait with interrupts disabled and a non-zero RCX value. Are there > possibly even any known CPU errata in that area? Not that it sounds > likely, but still.. Remind me

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones
On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote: > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > > the loop over the sequence counter in ktime_get() either (it's not > > increasing *that* quickly). But some basically infinite __run_hrtimer > > thing or s

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Co

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 11:15 AM, Linus Torvalds wrote: > > In your earlier trace (with spinlock debugging), the softlockup > detection was in lock_acquire for copy_page_range(), but CPU2 was > always in that "generic_exec_single" due to a TLB flush from that > zap_page_range thing again. But ther

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Peter Zijlstra
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > sched: RT throttling activated > > And after RT throttling, it's random (not even always trinity), but > that's probably because the watchdog thread doesn't run reliably any > more. So if we want to shoot that RT throttling

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 6:55 AM, Dave Jones wrote: > > Wish DEBUG_SPINLOCK disabled, I see the same behaviour. > Lots of traces spewed, but it seems to run and run (at least so far). Ok, so it's not spinlock debugging. There are some interesting patters here, once again. Lookie: RIP: 0010:

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Fri, Dec 19, 2014 at 9:55 AM, Dave Jones wrote: On Thu, Dec 18, 2014 at 08:48:24PM -0800, Linus Torvalds wrote: > On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones wrote: > > > > So the only thing that was on that could cause spinlock overhead > > was DEBUG_SPINLOCK (and LOCK_STAT, though

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones
On Fri, Dec 19, 2014 at 09:30:37AM -0500, Chris Mason wrote: > > in more recent builds. I've been running kitchen-sink debug kernels > > for my trinity runs for the last three years, and it's only this > > last few months that this has got to be enough of a problem that I'm > > not seeing the

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Thu, Dec 18, 2014 at 10:58 PM, Dave Jones wrote: On Thu, Dec 18, 2014 at 07:49:41PM -0800, Linus Torvalds wrote: > And when spinlocks start getting contention, *nested* spinlocks > really really hurt. And you've got all the spinlock debugging on etc, > don't you? Yeah, though rememb

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Peter Zijlstra
On Thu, Dec 18, 2014 at 08:48:24PM -0800, Linus Torvalds wrote: > On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones wrote: > > > > So the only thing that was on that could cause spinlock overhead > > was DEBUG_SPINLOCK (and LOCK_STAT, though iirc that's not huge either) > > So DEBUG_SPINLOCK does have

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Linus Torvalds
On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones wrote: > > So the only thing that was on that could cause spinlock overhead > was DEBUG_SPINLOCK (and LOCK_STAT, though iirc that's not huge either) So DEBUG_SPINLOCK does have one big downside if I recall correctly - the debugging spinlocks are very mu

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Dave Jones
On Thu, Dec 18, 2014 at 10:58:59PM -0500, Dave Jones wrote: > > lock debugging and other overheads (does this still have > > DEBUG_PAGEALLOC?) you really are getting into a "real" softlockup > > because things are scaling so horribly badly. > > > > If you now disable spinlock debugging a

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Dave Jones
On Thu, Dec 18, 2014 at 07:49:41PM -0800, Linus Torvalds wrote: > And when spinlocks start getting contention, *nested* spinlocks > really really hurt. And you've got all the spinlock debugging on etc, > don't you? Yeah, though remember this seems to have for some reason gotten worse in more

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Linus Torvalds
On Thu, Dec 18, 2014 at 6:45 PM, Dave Jones wrote: > > Example of the spew-o-rama below. Hmm. Not only does it apparently stay up for you now, the traces seem to be improving in quality. There's a decided pattern of "copy_page_range()" and "zap_page_range()" here. Now, what's *also* intriguing

Re: save_xstate_sig (Re: frequent lockups in 3.18rc4)

2014-12-18 Thread Andy Lutomirski
On Thu, Dec 18, 2014 at 1:34 PM, Linus Torvalds wrote: > On Thu, Dec 18, 2014 at 1:17 PM, Andy Lutomirski wrote: >> >> I admit that my understanding of the disaster that is x86's FPU handling is >> limited, but I'm moderately confident that save_xstate_sig is broken. > > Very possible. The FPU co

Re: save_xstate_sig (Re: frequent lockups in 3.18rc4)

2014-12-18 Thread Dave Jones
On Thu, Dec 18, 2014 at 01:17:59PM -0800, Andy Lutomirski wrote: > FWIW, if xsave traps with cr2 value, then there would indeed be an > infinite loop in here. It seems to work right on my machine. Dave, > want to run the attached little test? XSAVE to offset 0 [OK]xsave offset = 0, cr2

Re: save_xstate_sig (Re: frequent lockups in 3.18rc4)

2014-12-18 Thread Linus Torvalds
On Thu, Dec 18, 2014 at 1:17 PM, Andy Lutomirski wrote: > > I admit that my understanding of the disaster that is x86's FPU handling is > limited, but I'm moderately confident that save_xstate_sig is broken. Very possible. The FPU code *is* nasty. > The code is: > > if (user_has_fpu()) {

save_xstate_sig (Re: frequent lockups in 3.18rc4)

2014-12-18 Thread Andy Lutomirski
On 12/14/2014 09:47 PM, Linus Torvalds wrote: On Sun, Dec 14, 2014 at 4:38 PM, Linus Torvalds wrote: Can anybody make sense of that backtrace, keeping in mind that we're looking for some kind of endless loop where we don't make progress? So looking at all the backtraces, which is kind of mes

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Linus Torvalds
On Thu, Dec 18, 2014 at 7:54 AM, Chris Mason wrote: > > CPU 2 seems to be the one making the least progress. I think he's calling > fork and then trying to allocate a debug object for his hrtimer, eventually > wandering into fill_pool from __debug_object_init(): Good call. I agree - fill_pool()

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Dave Jones
On Thu, Dec 18, 2014 at 10:54:19AM -0500, Chris Mason wrote: > CPU 2 seems to be the one making the least progress. I think he's > calling fork and then trying to allocate a debug object for his > hrtimer, eventually wandering into fill_pool from __debug_object_init(): > > static void

Re: frequent lockups in 3.18rc4

2014-12-18 Thread Chris Mason
On Thu, Dec 18, 2014 at 12:13 AM, Dave Jones wrote: On Mon, Dec 15, 2014 at 03:46:41PM -0800, Linus Torvalds wrote: > On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds > wrote: > > > > So let's just fix it. Here's a completely untested patch. > > So after looking at this more, I'm actual

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Dave Jones
On Mon, Dec 15, 2014 at 03:46:41PM -0800, Linus Torvalds wrote: > On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds > wrote: > > > > So let's just fix it. Here's a completely untested patch. > > So after looking at this more, I'm actually really convinced that this > was a pretty nasty bug.

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Linus Torvalds
On Wed, Dec 17, 2014 at 6:42 PM, Sasha Levin wrote: > > I guess you did "just screwed up"... See the email to Dave, pick the fix from there, or from commit cf3c0a1579ef ("x86: mm: fix VM_FAULT_RETRY handling") Linus -- To unsubscribe from this list: send the line "unsubscribe

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Sasha Levin
On 12/15/2014 06:46 PM, Linus Torvalds wrote: > I cleaned up the patch a bit, split it up into two to clarify it, and > have committed it to my tree. I'm not marking the patches for stable, > because while I'm convinced it's a bug, I'm also not sure why even if > it triggers it doesn't eventually r

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Dave Jones
On Wed, Dec 17, 2014 at 11:51:45AM -0800, Linus Torvalds wrote: > On Wed, Dec 17, 2014 at 10:57 AM, Dave Jones wrote: > > On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote: > > > > > I'm going to try your two patches on top of .18, with the same kernel > > > config, and see where t

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Linus Torvalds
On Wed, Dec 17, 2014 at 10:57 AM, Dave Jones wrote: > On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote: > > > I'm going to try your two patches on top of .18, with the same kernel > > config, and see where that takes us. > > Hopefully to happier places. > > Not so much. Died very qui

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Linus Torvalds
On Wed, Dec 17, 2014 at 10:22 AM, Dave Jones wrote: > > Here's save_xstate_sig: Ok, that just confirmed that it was the call to __clear_user and the "xsave64" instruction like expected. And the offset in __clear_user() was just the return address after the call to "might_fault", so this all match

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Dave Jones
On Wed, Dec 17, 2014 at 01:57:55PM -0500, Dave Jones wrote: > On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote: > > > I'm going to try your two patches on top of .18, with the same kernel > > config, and see where that takes us. > > Hopefully to happier places. > > Not so muc

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Dave Jones
On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote: > I'm going to try your two patches on top of .18, with the same kernel > config, and see where that takes us. > Hopefully to happier places. Not so much. Died very quickly. [ 270.822490] BUG: unable to handle kernel paging reques

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Dave Jones
On Sun, Dec 14, 2014 at 04:38:00PM -0800, Linus Torvalds wrote: > And I could fairly easily imagine endless page faults due to the > exception table, or even endless signal handling loops due to getting > a signal while trying to handle a signal. Both things that would > actually reasonably re

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Peter Zijlstra
On Wed, Dec 17, 2014 at 12:01:39PM -0500, Konrad Rzeszutek Wilk wrote: > > Linus, do you have a pointer to whatever version of the patch you tried? > > The patch was this: > > a) http://article.gmane.org/gmane.linux.kernel/1835331 > > Then Jurgen had a patch: > https://lkml.kernel.org/g/CA+55aFx

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Konrad Rzeszutek Wilk
On Tue, Dec 16, 2014 at 04:41:16PM -0800, Andy Lutomirski wrote: > On Tue, Dec 16, 2014 at 4:00 PM, Linus Torvalds > wrote: > > On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra > > wrote: > >> > >> OK, should we just stick it in the x86 tree and see if anything > >> explodes? ;-) > > > > Gaah, I

Re: frequent lockups in 3.18rc4

2014-12-17 Thread Peter Zijlstra
On Tue, Dec 02, 2014 at 08:33:53AM -0800, Linus Torvalds wrote: > On Tue, Dec 2, 2014 at 6:13 AM, Mike Galbraith > wrote: > > > > The bean counting problem below can contribute. > > > > https://lkml.org/lkml/2014/3/30/7 > > Hmm. That never got applied. I didn't apply it originally because of > t

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Andy Lutomirski
On Tue, Dec 16, 2014 at 4:00 PM, Linus Torvalds wrote: > On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra wrote: >> >> OK, should we just stick it in the x86 tree and see if anything >> explodes? ;-) > > Gaah, I got confused about the patches. > > And something did explode, it showed some Xen nast

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Linus Torvalds
On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra wrote: > > OK, should we just stick it in the x86 tree and see if anything > explodes? ;-) Gaah, I got confused about the patches. And something did explode, it showed some Xen nasties. Xen has that odd "we don't share PMD entries between MM's" thi

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Peter Zijlstra
On Tue, Dec 16, 2014 at 09:19:21PM +, Mel Gorman wrote: > On Tue, Dec 16, 2014 at 12:46:57PM -0800, Linus Torvalds wrote: > > On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra > > wrote: > > > > > > While going through this thread I wondered whatever became of this > > > patch. It seems a sham

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Mel Gorman
On Tue, Dec 16, 2014 at 12:46:57PM -0800, Linus Torvalds wrote: > On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra wrote: > > > > While going through this thread I wondered whatever became of this > > patch. It seems a shame to forget about it entirely. Maybe just queued > > for later while huntin

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Linus Torvalds
On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra wrote: > > While going through this thread I wondered whatever became of this > patch. It seems a shame to forget about it entirely. Maybe just queued > for later while hunting wabbits? Mel Gorman took it up, cleaned up some stuff, and I think it's

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Peter Zijlstra
On Fri, Nov 21, 2014 at 02:55:27PM -0800, Linus Torvalds wrote: > On Fri, Nov 21, 2014 at 1:11 PM, Thomas Gleixner wrote: > > > > I'm fine with that. I just think it's not horrid enough, but that can > > be fixed easily :) > > Oh, I think it's plenty horrid. > > Anyway, here's an actual patch. A

Re: frequent lockups in 3.18rc4

2014-12-16 Thread Paul E. McKenney
On Mon, Dec 15, 2014 at 08:16:04AM -0500, Sasha Levin wrote: > On 12/15/2014 07:56 AM, Paul E. McKenney wrote: > > And maybe it would help if I did the CONFIG_TASKS_RCU=n case as well as > > the CONFIG_TASKS_RCU=y case. Please see below for an updated patch. > > I do have CONFIG_TASKS_RCU=y OK,

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Hillf Danton
> > But me not seeing any other bug clearly doesn't mean it doesn't exist. > Perhaps we can easy the zap loop if it is busy. thanks Hillf --- a/mm/memory.c Tue Dec 16 10:38:03 2014 +++ b/mm/memory.c Tue Dec 16 10:42:07 2014 @@ -1212,8 +1212,10 @@ again: force_flush

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Linus Torvalds
On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds wrote: > > So let's just fix it. Here's a completely untested patch. So after looking at this more, I'm actually really convinced that this was a pretty nasty bug. I'm *not* convinced that it's necessarily *your* bug, but I still think it could be

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Linus Torvalds
On Sun, Dec 14, 2014 at 9:57 PM, Dave Jones wrote: > > We had a flashback to that old bug last month too. > See this mail & your followup. : https://lkml.org/lkml/2014/11/25/1171 > That was during a bisect though, so may have been something > entirely different, but it is a spooky coincidence. Ye

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Borislav Petkov
On Sun, Dec 14, 2014 at 09:47:26PM -0800, Linus Torvalds wrote: > and "save_xstate_sig+0x81" shows up on all stacks, although only on > CPU1 does it show up as a "guaranteed" part of the stack chain (ie it > matches frame pointer data too). CPU1 also has that __clear_user show > up (which is called

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Sasha Levin
On 12/15/2014 07:56 AM, Paul E. McKenney wrote: > And maybe it would help if I did the CONFIG_TASKS_RCU=n case as well as > the CONFIG_TASKS_RCU=y case. Please see below for an updated patch. I do have CONFIG_TASKS_RCU=y Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Paul E. McKenney
On Sun, Dec 14, 2014 at 10:33:31PM -0800, Paul E. McKenney wrote: > On Sun, Dec 14, 2014 at 08:20:13PM -0500, Sasha Levin wrote: > > On 12/14/2014 07:11 PM, Paul E. McKenney wrote: > > >> Does it depend on anything not currently in -next? My build fails with > > >> > > > >> > kernel/rcu/tree.c: In

Re: frequent lockups in 3.18rc4

2014-12-15 Thread Martin van Es
On Fri, Dec 12, 2014 at 1:58 PM, Martin van Es wrote: > On Sat, Dec 6, 2014 at 9:09 PM, Linus Torvalds > I will give 3.18 a try on production J1900. Knowing I can go back to > safety in 3.16.7 won't hurt too much of my reputation I hope. 3.18 froze twice (just to be sure) as well. Will commence t

Re: frequent lockups in 3.18rc4

2014-12-14 Thread Paul E. McKenney
On Sun, Dec 14, 2014 at 08:20:13PM -0500, Sasha Levin wrote: > On 12/14/2014 07:11 PM, Paul E. McKenney wrote: > >> Does it depend on anything not currently in -next? My build fails with > >> > > >> > kernel/rcu/tree.c: In function ‘rcu_report_qs_rdp’: > >> > kernel/rcu/tree.c:2099:6: error: ‘stru

  1   2   3   4   5   >