On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es wrote:
>
> Best I can come up with now is try the next mainline that has all the
> fixes and ideas in this thread incorporated. Would that be 3.19?
Yes. I'm attaching a patch (very much experimental - it might
introduce new problems rather than fix o
To follow up on this long standing promise to bisect.
I've made two attempts at bisecting and both landed in limbo. It's
hard to explain but it feels like this bug has quantum properties;
I know for sure it's present in 3.17 and not in 3.16(.7). But once I
start bisecting it gets less pronounced.
On Sun, 21 Dec 2014, Linus Torvalds wrote:
> On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote:
> > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> > >
> > > And finally, and stupidly, is there any chance that you have anything
> > > accessing /dev/hpet?
> >
> > Not knowingly a
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds
wrote:
> On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote:
>>
>> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
>> reduction to simplify things)
>
> Again, the 50% margin is only on the multiplication overflow. Not on the mask.
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote:
>
> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
> reduction to simplify things)
Again, the 50% margin is only on the multiplication overflow. Not on the mask.
So it won't do anything at all for the case we actually care
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds
wrote:
> On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote:
>>
>> So I sent out a first step validation check to warn us if we end up
>> with idle periods that are larger then we expect.
>
> .. not having tested it, this is just from reading the pat
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote:
>
> So I sent out a first step validation check to warn us if we end up
> with idle periods that are larger then we expect.
.. not having tested it, this is just from reading the patch, but it
would *seem* that it doesn't actually validate the cl
On 01/02/2015 07:27 PM, John Stultz wrote:
> On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
> wrote:
>> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones
>> > wrote:
>>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>> >>
>>> >> > One thing I think I'll try is to try and narrow
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>> > One thing I think I'll try is to try and narrow down which
>> > syscalls are triggering those "Clocksource hpet ha
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
> On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote:
> >
> > * So 1/8th of the interval seems way too short, as there's
> > clocksources like the ACP PM, which wrap every 2.5 seconds or so.
>
> Ugh. At the same time, 1/8th of a rang
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote:
> > >
> > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
> > > "switched to hpet" messages there yet?
> >
> > not yet. 3 hrs in.
>
> Ok, so then th
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote:
> >
> > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
> > "switched to hpet" messages there yet?
>
> not yet. 3 hrs in.
Ok, so then the
INFO: rcu_preempt detected stalls on CPUs/tasks:
has nothing to do with HPET, s
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > still running though..
>
> Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
I'll check it again in the morning, but before I turn in for th
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > still running though..
>
> Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
I don't think so.
> For the last night, can you see if you ca
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > hm.
>
> So with the previous patch that had the false positives, you never saw
> this? You saw the false positives instead?
correct.
> I'm wondering if the added
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
>
> still running though..
Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
For the last night, can you see if you can just run it with that, and
things work? Because by now, my gut feel is that we should start
deratin
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
>
> hm.
So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?
I'm wondering if the added debug noise just ended up helping. Doing a
printk() will automatically cause some scheduler acti
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
hm.
[ 2733.047100]
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
I'll give it a spin a
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote:
> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>
> > One thing I think I'll try is to try and narrow down which
> > syscalls are triggering those "Clocksource hpet had cycles off"
> > messages. I'm still unclear on exactly what
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
> One thing I think I'll try is to try and narrow down which
> syscalls are triggering those "Clocksource hpet had cycles off"
> messages. I'm still unclear on exactly what is doing
> the stomping on the hpet.
First I ran trinity wi
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote:
> On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
>
> > But in the meantime please do keep that thing running as long as you
> > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> > result -
On 12/23/2014 09:56 AM, Dave Jones wrote:
> On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
>
> > But in the meantime please do keep that thing running as long as you
> > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> > result - the original softlockup
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
> But in the meantime please do keep that thing running as long as you
> can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> result - the original softlockup bug happening *without* any bigger
> hpet jumps.
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
> But in the meantime please do keep that thing running as long as you
> can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> result - the original softlockup bug happening *without* any bigger
> hpet jumps.
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote:
>
> * So 1/8th of the interval seems way too short, as there's
> clocksources like the ACP PM, which wrap every 2.5 seconds or so.
Ugh. At the same time, 1/8th of a range is actually bigger than I'd
like, since if there is some timer corruption,
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
> wrote:
>>
>> This is *not* to say that this is the bug you're hitting. But it does show
>> that
>>
>> (a) a flaky HPET can do some seriously bad stuff
>> (b) the kernel is very fragile w
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones wrote:
>
> I tried the nohpet thing for a few hours this morning and didn't see
> anything weird, but it may have been that I just didn't run long enough.
> When I saw your patch, I gave that a shot instead, with hpet enabled
> again. Just got back to f
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote:
> And again: this is not trying to make the kernel clock not jump. There
> is no way I can come up with even in theory to try to really *fix* a
> fundamentally broken clock.
>
> So this is not meant to be a real "fix" for anythin
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
wrote:
>
> .. and we might still lock up under some circumstances. But at least
> from my limited testing, it is infinitely much better, even if it
> might not be perfect. Also note that my "testing" has been writing
> zero to the HPET lock (so the
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
wrote:
>
> This is *not* to say that this is the bug you're hitting. But it does show
> that
>
> (a) a flaky HPET can do some seriously bad stuff
> (b) the kernel is very fragile wrt time going backwards.
>
> and maybe we can use this test program
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
> wrote:
> >
> > The second time (or third, or fourth - it might not take immediately)
> > you get a lockup or similar. Bad things happen.
>
> I've only tested it twice now, but the f
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
> > The second time (or third, or fourth - it might not take immediately)
> > you get a lockup or similar. Bad things happen.
>
> I've only tested it twice now, but the first time I got a weird
> lockup-like thing (things *kind*
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
wrote:
>
> The second time (or third, or fourth - it might not take immediately)
> you get a lockup or similar. Bad things happen.
I've only tested it twice now, but the first time I got a weird
lockup-like thing (things *kind* of worked, but I coul
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds
wrote:
>
> I can do the mmap(/dev/mem) thing and access the HPET by hand, and
> when I write zero to it I immediately get something like this:
>
> Clocksource tsc unstable (delta = -284317725450 ns)
> Switched to clocksource hpet
>
> just to conf
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote:
> On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> >
> > And finally, and stupidly, is there any chance that you have anything
> > accessing /dev/hpet?
>
> Not knowingly at least, but who the hell knows what systemd has its
> fi
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> > So the range of 1-251 seconds is not entirely random. It's all in
> > that "32-bit HPET range".
>
> DaveJ, I assume it's too late now, and you don't effectively have any
> access to the machine any more, but "hpet=disable"
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds
wrote:
>
> So the range of 1-251 seconds is not entirely random. It's all in
> that "32-bit HPET range".
DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but "hpet=disable" or "nohpet" on the
com
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
wrote:
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see anything wrong. The whole "update_wall_clock" and the shadow
> timekeeping state is confusing as hell, but seems fine. We'd have to
> avoid update_wall_cloc
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote:
> On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
> wrote:
> >
> > How/where is the HPET overflow case handled? I don't know the code enough.
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see any
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
wrote:
>
> How/where is the HPET overflow case handled? I don't know the code enough.
Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole "update_wall_clock" and the shadow
timekeeping state is
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
wrote:
>
> I'm claiming that the race happened *once*. And it then corrupted some
> data structure or similar sufficiently that CPU0 keeps looping.
>
> Perhaps something keeps re-adding itself to the head of the timerqueue
> due to the race.
So tick
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote:
> > Right now I'm doing Chris' idea of "turn debugging back on,
> > and try without serial console". Shall I try your suggestion
> > on top of that ?
>
> Might as well. I doubt it really will make any difference, but I also
> d
On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner wrote:
>
> The watchdog timer runs on a fully periodic schedule. It's self
> rearming via
>
> hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
>
> So if that aligns with the equally periodic tick interrupt on the
> other CPU then y
On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote:
> > But at the very end this would be detected by the runtime check of the
> > hrtimer interrupt, which does not trigger. And it would trigger at
> > some point as ALL cpus including CPU0 in that trac
On Fri, 19 Dec 2014, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote:
> > Now that all looks correct. So there is something else going on. After
> > staring some more at it, I think we are looking at it from the wrong
> > angle.
> >
> > The watchdog always detects CP
On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner
wrote:
On Fri, 19 Dec 2014, Chris Mason wrote:
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time
that
> CPU1 is waiting for some other CPU to pick up the IPI,
On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote:
>
> Now that all looks correct. So there is something else going on. After
> staring some more at it, I think we are looking at it from the wrong
> angle.
>
> The watchdog always detects CPU1 as stuck and we got completely
> fixated on the cs
On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> > Here's another pattern. In your latest thing, every single time that
> > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> > doing this:
> >
> > [24998.060963] NMI back
On Fri, 19 Dec 2014, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
>
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not
On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones wrote:
>
> Right now I'm doing Chris' idea of "turn debugging back on,
> and try without serial console". Shall I try your suggestion
> on top of that ?
Might as well. I doubt it really will make any difference, but I also
don't think it will interact
On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
> wrote:
> >
> > I do note that we depend on the "new mwait" semantics where we do
> > mwait with interrupts disabled and a non-zero RCX value. Are there
> > possibly even any k
On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
wrote:
>
> I do note that we depend on the "new mwait" semantics where we do
> mwait with interrupts disabled and a non-zero RCX value. Are there
> possibly even any known CPU errata in that area? Not that it sounds
> likely, but still..
Remind me
On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote:
> > So it's not stuck *inside* read_hpet(), and it's almost certainly not
> > the loop over the sequence counter in ktime_get() either (it's not
> > increasing *that* quickly). But some basically infinite __run_hrtimer
> > thing or s
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
>
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940 Co
On Fri, Dec 19, 2014 at 11:15 AM, Linus Torvalds
wrote:
>
> In your earlier trace (with spinlock debugging), the softlockup
> detection was in lock_acquire for copy_page_range(), but CPU2 was
> always in that "generic_exec_single" due to a TLB flush from that
> zap_page_range thing again. But ther
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> sched: RT throttling activated
>
> And after RT throttling, it's random (not even always trinity), but
> that's probably because the watchdog thread doesn't run reliably any
> more.
So if we want to shoot that RT throttling
On Fri, Dec 19, 2014 at 6:55 AM, Dave Jones wrote:
>
> Wish DEBUG_SPINLOCK disabled, I see the same behaviour.
> Lots of traces spewed, but it seems to run and run (at least so far).
Ok, so it's not spinlock debugging.
There are some interesting patters here, once again. Lookie:
RIP: 0010:
On Fri, Dec 19, 2014 at 9:55 AM, Dave Jones wrote:
On Thu, Dec 18, 2014 at 08:48:24PM -0800, Linus Torvalds wrote:
> On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones
wrote:
> >
> > So the only thing that was on that could cause spinlock overhead
> > was DEBUG_SPINLOCK (and LOCK_STAT, though
On Fri, Dec 19, 2014 at 09:30:37AM -0500, Chris Mason wrote:
> > in more recent builds. I've been running kitchen-sink debug kernels
> > for my trinity runs for the last three years, and it's only this
> > last few months that this has got to be enough of a problem that I'm
> > not seeing the
On Thu, Dec 18, 2014 at 10:58 PM, Dave Jones wrote:
On Thu, Dec 18, 2014 at 07:49:41PM -0800, Linus Torvalds wrote:
> And when spinlocks start getting contention, *nested* spinlocks
> really really hurt. And you've got all the spinlock debugging on
etc,
> don't you?
Yeah, though rememb
On Thu, Dec 18, 2014 at 08:48:24PM -0800, Linus Torvalds wrote:
> On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones wrote:
> >
> > So the only thing that was on that could cause spinlock overhead
> > was DEBUG_SPINLOCK (and LOCK_STAT, though iirc that's not huge either)
>
> So DEBUG_SPINLOCK does have
On Thu, Dec 18, 2014 at 8:03 PM, Dave Jones wrote:
>
> So the only thing that was on that could cause spinlock overhead
> was DEBUG_SPINLOCK (and LOCK_STAT, though iirc that's not huge either)
So DEBUG_SPINLOCK does have one big downside if I recall correctly -
the debugging spinlocks are very mu
On Thu, Dec 18, 2014 at 10:58:59PM -0500, Dave Jones wrote:
> > lock debugging and other overheads (does this still have
> > DEBUG_PAGEALLOC?) you really are getting into a "real" softlockup
> > because things are scaling so horribly badly.
> >
> > If you now disable spinlock debugging a
On Thu, Dec 18, 2014 at 07:49:41PM -0800, Linus Torvalds wrote:
> And when spinlocks start getting contention, *nested* spinlocks
> really really hurt. And you've got all the spinlock debugging on etc,
> don't you?
Yeah, though remember this seems to have for some reason gotten worse
in more
On Thu, Dec 18, 2014 at 6:45 PM, Dave Jones wrote:
>
> Example of the spew-o-rama below.
Hmm. Not only does it apparently stay up for you now, the traces seem
to be improving in quality.
There's a decided pattern of "copy_page_range()" and "zap_page_range()" here.
Now, what's *also* intriguing
On Thu, Dec 18, 2014 at 1:34 PM, Linus Torvalds
wrote:
> On Thu, Dec 18, 2014 at 1:17 PM, Andy Lutomirski wrote:
>>
>> I admit that my understanding of the disaster that is x86's FPU handling is
>> limited, but I'm moderately confident that save_xstate_sig is broken.
>
> Very possible. The FPU co
On Thu, Dec 18, 2014 at 01:17:59PM -0800, Andy Lutomirski wrote:
> FWIW, if xsave traps with cr2 value, then there would indeed be an
> infinite loop in here. It seems to work right on my machine. Dave,
> want to run the attached little test?
XSAVE to offset 0
[OK]xsave offset = 0, cr2
On Thu, Dec 18, 2014 at 1:17 PM, Andy Lutomirski wrote:
>
> I admit that my understanding of the disaster that is x86's FPU handling is
> limited, but I'm moderately confident that save_xstate_sig is broken.
Very possible. The FPU code *is* nasty.
> The code is:
>
> if (user_has_fpu()) {
On 12/14/2014 09:47 PM, Linus Torvalds wrote:
On Sun, Dec 14, 2014 at 4:38 PM, Linus Torvalds
wrote:
Can anybody make sense of that backtrace, keeping in mind that we're
looking for some kind of endless loop where we don't make progress?
So looking at all the backtraces, which is kind of mes
On Thu, Dec 18, 2014 at 7:54 AM, Chris Mason wrote:
>
> CPU 2 seems to be the one making the least progress. I think he's calling
> fork and then trying to allocate a debug object for his hrtimer, eventually
> wandering into fill_pool from __debug_object_init():
Good call.
I agree - fill_pool()
On Thu, Dec 18, 2014 at 10:54:19AM -0500, Chris Mason wrote:
> CPU 2 seems to be the one making the least progress. I think he's
> calling fork and then trying to allocate a debug object for his
> hrtimer, eventually wandering into fill_pool from __debug_object_init():
>
> static void
On Thu, Dec 18, 2014 at 12:13 AM, Dave Jones wrote:
On Mon, Dec 15, 2014 at 03:46:41PM -0800, Linus Torvalds wrote:
> On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds
> wrote:
> >
> > So let's just fix it. Here's a completely untested patch.
>
> So after looking at this more, I'm actual
On Mon, Dec 15, 2014 at 03:46:41PM -0800, Linus Torvalds wrote:
> On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds
> wrote:
> >
> > So let's just fix it. Here's a completely untested patch.
>
> So after looking at this more, I'm actually really convinced that this
> was a pretty nasty bug.
On Wed, Dec 17, 2014 at 6:42 PM, Sasha Levin wrote:
>
> I guess you did "just screwed up"...
See the email to Dave, pick the fix from there, or from commit
cf3c0a1579ef ("x86: mm: fix VM_FAULT_RETRY handling")
Linus
--
To unsubscribe from this list: send the line "unsubscribe
On 12/15/2014 06:46 PM, Linus Torvalds wrote:
> I cleaned up the patch a bit, split it up into two to clarify it, and
> have committed it to my tree. I'm not marking the patches for stable,
> because while I'm convinced it's a bug, I'm also not sure why even if
> it triggers it doesn't eventually r
On Wed, Dec 17, 2014 at 11:51:45AM -0800, Linus Torvalds wrote:
> On Wed, Dec 17, 2014 at 10:57 AM, Dave Jones wrote:
> > On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote:
> >
> > > I'm going to try your two patches on top of .18, with the same kernel
> > > config, and see where t
On Wed, Dec 17, 2014 at 10:57 AM, Dave Jones wrote:
> On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote:
>
> > I'm going to try your two patches on top of .18, with the same kernel
> > config, and see where that takes us.
> > Hopefully to happier places.
>
> Not so much. Died very qui
On Wed, Dec 17, 2014 at 10:22 AM, Dave Jones wrote:
>
> Here's save_xstate_sig:
Ok, that just confirmed that it was the call to __clear_user and the
"xsave64" instruction like expected. And the offset in __clear_user()
was just the return address after the call to "might_fault", so this
all match
On Wed, Dec 17, 2014 at 01:57:55PM -0500, Dave Jones wrote:
> On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote:
>
> > I'm going to try your two patches on top of .18, with the same kernel
> > config, and see where that takes us.
> > Hopefully to happier places.
>
> Not so muc
On Wed, Dec 17, 2014 at 01:22:41PM -0500, Dave Jones wrote:
> I'm going to try your two patches on top of .18, with the same kernel
> config, and see where that takes us.
> Hopefully to happier places.
Not so much. Died very quickly.
[ 270.822490] BUG: unable to handle kernel paging reques
On Sun, Dec 14, 2014 at 04:38:00PM -0800, Linus Torvalds wrote:
> And I could fairly easily imagine endless page faults due to the
> exception table, or even endless signal handling loops due to getting
> a signal while trying to handle a signal. Both things that would
> actually reasonably re
On Wed, Dec 17, 2014 at 12:01:39PM -0500, Konrad Rzeszutek Wilk wrote:
> > Linus, do you have a pointer to whatever version of the patch you tried?
>
> The patch was this:
>
> a) http://article.gmane.org/gmane.linux.kernel/1835331
>
> Then Jurgen had a patch:
> https://lkml.kernel.org/g/CA+55aFx
On Tue, Dec 16, 2014 at 04:41:16PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 16, 2014 at 4:00 PM, Linus Torvalds
> wrote:
> > On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra
> > wrote:
> >>
> >> OK, should we just stick it in the x86 tree and see if anything
> >> explodes? ;-)
> >
> > Gaah, I
On Tue, Dec 02, 2014 at 08:33:53AM -0800, Linus Torvalds wrote:
> On Tue, Dec 2, 2014 at 6:13 AM, Mike Galbraith
> wrote:
> >
> > The bean counting problem below can contribute.
> >
> > https://lkml.org/lkml/2014/3/30/7
>
> Hmm. That never got applied. I didn't apply it originally because of
> t
On Tue, Dec 16, 2014 at 4:00 PM, Linus Torvalds
wrote:
> On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra wrote:
>>
>> OK, should we just stick it in the x86 tree and see if anything
>> explodes? ;-)
>
> Gaah, I got confused about the patches.
>
> And something did explode, it showed some Xen nast
On Tue, Dec 16, 2014 at 3:02 PM, Peter Zijlstra wrote:
>
> OK, should we just stick it in the x86 tree and see if anything
> explodes? ;-)
Gaah, I got confused about the patches.
And something did explode, it showed some Xen nasties. Xen has that
odd "we don't share PMD entries between MM's" thi
On Tue, Dec 16, 2014 at 09:19:21PM +, Mel Gorman wrote:
> On Tue, Dec 16, 2014 at 12:46:57PM -0800, Linus Torvalds wrote:
> > On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra
> > wrote:
> > >
> > > While going through this thread I wondered whatever became of this
> > > patch. It seems a sham
On Tue, Dec 16, 2014 at 12:46:57PM -0800, Linus Torvalds wrote:
> On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra wrote:
> >
> > While going through this thread I wondered whatever became of this
> > patch. It seems a shame to forget about it entirely. Maybe just queued
> > for later while huntin
On Tue, Dec 16, 2014 at 11:28 AM, Peter Zijlstra wrote:
>
> While going through this thread I wondered whatever became of this
> patch. It seems a shame to forget about it entirely. Maybe just queued
> for later while hunting wabbits?
Mel Gorman took it up, cleaned up some stuff, and I think it's
On Fri, Nov 21, 2014 at 02:55:27PM -0800, Linus Torvalds wrote:
> On Fri, Nov 21, 2014 at 1:11 PM, Thomas Gleixner wrote:
> >
> > I'm fine with that. I just think it's not horrid enough, but that can
> > be fixed easily :)
>
> Oh, I think it's plenty horrid.
>
> Anyway, here's an actual patch. A
On Mon, Dec 15, 2014 at 08:16:04AM -0500, Sasha Levin wrote:
> On 12/15/2014 07:56 AM, Paul E. McKenney wrote:
> > And maybe it would help if I did the CONFIG_TASKS_RCU=n case as well as
> > the CONFIG_TASKS_RCU=y case. Please see below for an updated patch.
>
> I do have CONFIG_TASKS_RCU=y
OK,
>
> But me not seeing any other bug clearly doesn't mean it doesn't exist.
>
Perhaps we can easy the zap loop if it is busy.
thanks
Hillf
--- a/mm/memory.c Tue Dec 16 10:38:03 2014
+++ b/mm/memory.c Tue Dec 16 10:42:07 2014
@@ -1212,8 +1212,10 @@ again:
force_flush
On Mon, Dec 15, 2014 at 10:21 AM, Linus Torvalds
wrote:
>
> So let's just fix it. Here's a completely untested patch.
So after looking at this more, I'm actually really convinced that this
was a pretty nasty bug.
I'm *not* convinced that it's necessarily *your* bug, but I still
think it could be
On Sun, Dec 14, 2014 at 9:57 PM, Dave Jones wrote:
>
> We had a flashback to that old bug last month too.
> See this mail & your followup. : https://lkml.org/lkml/2014/11/25/1171
> That was during a bisect though, so may have been something
> entirely different, but it is a spooky coincidence.
Ye
On Sun, Dec 14, 2014 at 09:47:26PM -0800, Linus Torvalds wrote:
> and "save_xstate_sig+0x81" shows up on all stacks, although only on
> CPU1 does it show up as a "guaranteed" part of the stack chain (ie it
> matches frame pointer data too). CPU1 also has that __clear_user show
> up (which is called
On 12/15/2014 07:56 AM, Paul E. McKenney wrote:
> And maybe it would help if I did the CONFIG_TASKS_RCU=n case as well as
> the CONFIG_TASKS_RCU=y case. Please see below for an updated patch.
I do have CONFIG_TASKS_RCU=y
Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe
On Sun, Dec 14, 2014 at 10:33:31PM -0800, Paul E. McKenney wrote:
> On Sun, Dec 14, 2014 at 08:20:13PM -0500, Sasha Levin wrote:
> > On 12/14/2014 07:11 PM, Paul E. McKenney wrote:
> > >> Does it depend on anything not currently in -next? My build fails with
> > >> >
> > >> > kernel/rcu/tree.c: In
On Fri, Dec 12, 2014 at 1:58 PM, Martin van Es wrote:
> On Sat, Dec 6, 2014 at 9:09 PM, Linus Torvalds
> I will give 3.18 a try on production J1900. Knowing I can go back to
> safety in 3.16.7 won't hurt too much of my reputation I hope.
3.18 froze twice (just to be sure) as well. Will commence t
On Sun, Dec 14, 2014 at 08:20:13PM -0500, Sasha Levin wrote:
> On 12/14/2014 07:11 PM, Paul E. McKenney wrote:
> >> Does it depend on anything not currently in -next? My build fails with
> >> >
> >> > kernel/rcu/tree.c: In function ‘rcu_report_qs_rdp’:
> >> > kernel/rcu/tree.c:2099:6: error: ‘stru
1 - 100 of 479 matches
Mail list logo