Hello, Mike.
On Tue, Feb 09, 2016 at 07:02:35PM +0100, Mike Galbraith wrote:
> > It doesn't do anything unless the user twiddles the mask to exclude
> > certain (think no_hz_full) CPUs, so there are no clueless victims.
>
> (a plus: testers/robots can twiddle mask to help find bugs, _and_
> nohz_
On Tue, Feb 9, 2016 at 9:51 AM, Tejun Heo wrote:
>>
>> (a) actually dequeue timers and work queues that are bound to a
>> particular CPU when a CPU goes down.
>>
> This goes the same for work items and timers. If we want to do
> explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'
On Tue, 2016-02-09 at 18:56 +0100, Mike Galbraith wrote:
> On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote:
> > Hello, Mike.
> >
> > On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> > > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask
> > > CPUs
> > >
> > > WORK
On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote:
> Hello, Mike.
>
> On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask
> > CPUs
> >
> > WORK_CPU_UNBOUND work items queued to a bound workqueue always run
> > locall
Hello, Mike.
On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote:
> workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs
>
> WORK_CPU_UNBOUND work items queued to a bound workqueue always run
> locally. This is a good thing normally, but not when the user has
> asked u
Hello,
On Tue, Feb 09, 2016 at 09:04:18AM -0800, Linus Torvalds wrote:
> On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo wrote:
> > idk, not doing so is likely to cause subtle bugs which are difficult
> > to track down. The problem with -stable is 874bbfe6 being backported
> > without the matching tim
On Tue, 2016-02-09 at 11:50 -0500, Tejun Heo wrote:
> Hello,
>
> On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote:
> > > A niggling question remaining is when is it gonna be killed?
> >
> > It probably should be killed sooner rather than later.
> >
> > Just document that if you nee
On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo wrote:
>
> idk, not doing so is likely to cause subtle bugs which are difficult
> to track down. The problem with -stable is 874bbfe6 being backported
> without the matching timer fix.
Well, according to this thread, even witht he timer fix the end resul
Hello,
On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote:
> > A niggling question remaining is when is it gonna be killed?
>
> It probably should be killed sooner rather than later.
>
> Just document that if you need something to run on a _particular_ cpu,
> you need to use "schedul
On Tue, Feb 9, 2016 at 7:31 AM, Mike Galbraith wrote:
> On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote:
>> >
>> > That 874bbfe6 should die.
>>
>> Yeah, it's gonna be killed. The commit is there because the behavior
>> change broke things. We don't want to guarantee it but have been and
>> ca
On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote:
> On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> >
> > > What are you suggesting?
> >
> > That 874bbfe6 should die.
>
> Yeah, it's gonna be killed. The commit is there
On Sun, 2016-02-07 at 06:19 +0100, Mike Galbraith wrote:
> On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote:
> > On Fri, 05 Feb 2016, Tejun Heo wrote:
> > > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote
On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote:
> On Fri, 05 Feb 2016, Tejun Heo wrote:
> > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> > >
> > > > What are you suggesting?
> > >
> > > That 874bb
On Fri, 05 Feb 2016, Tejun Heo wrote:
> On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> >
> > > What are you suggesting?
> >
> > That 874bbfe6 should die.
>
> Yeah, it's gonna be killed. The commit is there because the be
On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote:
> On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
>
> > What are you suggesting?
>
> That 874bbfe6 should die.
Yeah, it's gonna be killed. The commit is there because the behavior
change broke things. We don't want to guarante
On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote:
> What are you suggesting?
That 874bbfe6 should die.
-Mike
Hello, Mike.
On Fri, Feb 05, 2016 at 09:47:11PM +0100, Mike Galbraith wrote:
> That very point is what makes it wrong for the workqueue code to ever
> target a work item. The instant it does target selection, correctness
> may be at stake, it doesn't know, thus it must assume the full onus,
> whi
On Fri, 2016-02-05 at 11:49 -0500, Tejun Heo wrote:
> Hello, Mike.
>
> On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote:
> > Isn't it the case that, currently at least, each and every spot that
> > requires execution on a specific CPU yet does not take active measures
> > to deal wit
Hello, Mike.
On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote:
> Isn't it the case that, currently at least, each and every spot that
> requires execution on a specific CPU yet does not take active measures
> to deal with hotplug events is in fact buggy? The timer code clearly
> sta
On Fri, 2016-02-05 at 09:11 +0100, Daniel Bilik wrote:
> On Fri, 05 Feb 2016 03:40:46 +0100
> Mike Galbraith wrote:
> > IMHO you should restore the CC list and re-post. (If I were the
> > maintainer of either the workqueue code or 3.18-stable, I'd be highly
> > interested in this finding).
>
>
On Fri, 05 Feb 2016 03:40:46 +0100
Mike Galbraith wrote:
> On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote:
> > On Thu, 4 Feb 2016 12:20:44 +0100
> > Jan Kara wrote:
> >
> > > Thanks for backport Thomas and to Mike for persistence :). I've
> > > asked my friend seeing crashes with 3.18.25
On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > > if (!(wq->flags & WQ_UNBOUND))
> > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu)
On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote:
> On Thu, 4 Feb 2016 12:20:44 +0100
> Jan Kara wrote:
>
> > Thanks for backport Thomas and to Mike for persistence :). I've asked my
> > friend seeing crashes with 3.18.25 to try whether this patch fixes the
> > issues. It may take some time
On Thu, 4 Feb 2016 12:20:44 +0100
Jan Kara wrote:
> Thanks for backport Thomas and to Mike for persistence :). I've asked my
> friend seeing crashes with 3.18.25 to try whether this patch fixes the
> issues. It may take some time so stay tuned...
Patch tested and it really fixes the crash we wer
On Thu 04-02-16 11:46:47, Thomas Gleixner wrote:
> On Thu, 4 Feb 2016, Mike Galbraith wrote:
> > On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> > > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > > > Hm, so it's ok to queue work to an offline CPU? What happens if it
> >
On Thu, 2016-02-04 at 11:46 +0100, Thomas Gleixner wrote:
> On Thu, 4 Feb 2016, Mike Galbraith wrote:
> > I'm also wondering why 22b886dd only applies to kernels >= 4.2.
> >
> >
> > Regardless of the previous CPU a timer was on, add_timer_on()
> > currently simply sets timer->flags to the new CP
On Thu, 4 Feb 2016, Mike Galbraith wrote:
> On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > > Hm, so it's ok to queue work to an offline CPU? What happens if it
> > > doesn't come back for an eternity or two?
> >
> > Righ
On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU? What happens if it
> > doesn't come back for an eternity or two?
>
> Right now, it just loses affinity
WRT affinity...
So
On Thu 04-02-16 07:37:23, Michal Hocko wrote:
> On Wed 03-02-16 11:59:01, Tejun Heo wrote:
> > On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
> [...]
> > > anything and add_timer_on also for WORK_CPU_UNBOUND is really required
> > > then we should at least preserve WORK_CPU_UNBOUND i
On Wed 03-02-16 11:59:01, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
[...]
> > anything and add_timer_on also for WORK_CPU_UNBOUND is really required
> > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that
> > __queue_work can actually move
On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU? What happens if it
> > doesn't come back for an eternity or two?
>
> Right now, it just loses affinity. A more interesting cas
On Wed, Feb 03, 2016 at 08:05:57PM +0100, Thomas Gleixner wrote:
> > Well, you're in an unnecessary escalation mode as usual. Was the
> > attitude really necessary? Chill out and read the thread again.
> > Michal is saying the dwork->cpu assignment was bogus and I was
> > refuting that.
>
> Right
On Wed, 3 Feb 2016, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote:
> > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > > > schedule per-cpu work on the CPU
Hello, Thomas.
On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote:
> > > So I think 874bbfe600a6 is really bogus. It should be reverted. We
> > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly
> > > schedule per-cpu work on the CPU we need it to run on"). This
On Wed, 3 Feb 2016, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > > if (!(wq->flags & WQ_UNBOUND))
> > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > > else
On Wed, Feb 03, 2016 at 06:13:15PM +0100, Mike Galbraith wrote:
> Ah, and the rest (the vast majority) can then be safely deflected away
> from nohz_full cpus.
Yeap, it should be possible to bounce majority of work items across
CPUs all we want.
--
tejun
On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> > Hm, so it's ok to queue work to an offline CPU? What happens if it
> > doesn't come back for an eternity or two?
>
> Right now, it just loses affinity. A more interesting cas
On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote:
> > So, the proper fix here is keeping cpu <-> node mapping stable across
> > cpu on/offlining which has been being worked on for a long time now.
> > The patchst is pending and it fixes other issues too.
>
> What if that node was memor
On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote:
> Hm, so it's ok to queue work to an offline CPU? What happens if it
> doesn't come back for an eternity or two?
Right now, it just loses affinity. A more interesting case is a cpu
going offline whlie work items bound to the cpu are
On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > > if (!(wq->flags & WQ_UNBOUND))
> > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu)
On Wed 03-02-16 11:24:41, Tejun Heo wrote:
> On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > > The CPU was 168, and that one was offlined in the meantime. So
> > > __queue_work fails at:
> > > if (!(wq->flags & WQ_UNBOUND))
> > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > >
On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote:
> > The CPU was 168, and that one was offlined in the meantime. So
> > __queue_work fails at:
> > if (!(wq->flags & WQ_UNBOUND))
> > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu);
> > else
> > pwq = unbound_pwq_by_node(wq, cpu_to_node
[I wasn't aware of this email thread before so I am jumping in late]
On Wed 03-02-16 10:35:32, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> > On Tue, 26 Jan 2016, Petr Mladek wrote:
[...]
> >> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in
> >> local cp
On Wed, 3 Feb 2016, Jiri Slaby wrote:
> On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> What happens in later kernels, when the cpu is offlined before the
> delayed_work timer ticks? In stable 3.12, with the patch, this scenario
> results in an oops:
> #5 [8c03fdd63d80] page_fault at ff
On 01/26/2016, 02:09 PM, Thomas Gleixner wrote:
> On Tue, 26 Jan 2016, Petr Mladek wrote:
>> On Tue 2016-01-26 10:34:00, Jan Kara wrote:
>>> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
On Sat, 23 Jan 2016, Ben Hutchings wrote:
> On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
On Tue, 26 Jan 2016, Petr Mladek wrote:
> On Tue 2016-01-26 10:34:00, Jan Kara wrote:
> > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > > Looks like it requires more than trivial
On Tue 2016-01-26 10:34:00, Jan Kara wrote:
> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > >
> > > > The t
On Tue, 26 Jan 2016, Jan Kara wrote:
> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> > On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > > Looks like it requires more than trivial backport (I think). Tejun?
> > > >
> > > > The timer mi
On Sat 23-01-16 17:11:54, Thomas Gleixner wrote:
> On Sat, 23 Jan 2016, Ben Hutchings wrote:
> > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > > Looks like it requires more than trivial backport (I think). Tejun?
> > >
> > > The timer migration has changed quite a bit. Given that we'v
On Sat, 23 Jan 2016, Ben Hutchings wrote:
> On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> > > Looks like it requires more than trivial backport (I think). Tejun?
> >
> > The timer migration has changed quite a bit. Given that we've never
> > seen vmstat work crashing in 3.18 era, I wonder
On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote:
> (cc'ing Thomas)
>
> On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote:
> > On 01/21/2016 04:52 AM, Jan Kara wrote:
> > > On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> > > > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> >
(cc'ing Thomas)
On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote:
> On 01/21/2016 04:52 AM, Jan Kara wrote:
> > On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> >> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> >>> Hello,
> >>>
> >>> a friend of mine started seeing crashes wit
On 01/21/2016 04:52 AM, Jan Kara wrote:
> On Wed 20-01-16 13:39:01, Shaohua Li wrote:
>> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
>>> Hello,
>>>
>>> a friend of mine started seeing crashes with 3.18.25 kernel - once
>>> appropriate load is put on the machine it crashes within minut
On 01/21/2016 04:52 AM, Jan Kara wrote:
> On Wed 20-01-16 13:39:01, Shaohua Li wrote:
>> > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
>>> > > Hello,
>>> > >
>>> > > a friend of mine started seeing crashes with 3.18.25 kernel - once
>>> > > appropriate load is put on the machine it c
On Wed 20-01-16 13:39:01, Shaohua Li wrote:
> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> > Hello,
> >
> > a friend of mine started seeing crashes with 3.18.25 kernel - once
> > appropriate load is put on the machine it crashes within minutes. He
> > tracked down that reverting com
On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote:
> Hello,
>
> a friend of mine started seeing crashes with 3.18.25 kernel - once
> appropriate load is put on the machine it crashes within minutes. He
> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
> Linus' tr
Hello,
a friend of mine started seeing crashes with 3.18.25 kernel - once
appropriate load is put on the machine it crashes within minutes. He
tracked down that reverting commit 874bbfe600a6 (this is the commit ID from
Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make
sur
57 matches
Mail list logo