On Thu, Sep 14, 2017 at 9:50 AM, Tim Chen wrote:
>
> Kan tested this before so it should be still good.
> I checked that it applied cleanly on latest master.
Thanks, applied.
I really hope we end up fixing the migration thing too, but at least
4.14 will have the mitigation for the long wait queu
On 09/13/2017 07:27 PM, Linus Torvalds wrote:
> On Wed, Sep 13, 2017 at 7:12 PM, Tim Chen wrote:
>>
>> BTW, will you be merging these 2 patches in 4.14?
>
> Yes, and thanks for reminding me.
>
> In fact, would you mind sending me the latest versions, rather than me
> digging them out of the disa
On Wed, 13 Sep 2017, Tim Chen wrote:
> Here's what the customer think happened and is willing to tell us.
> They have a parent process that spawns off 10 children per core and
> kicked them to run. The child processes all access a common library.
> We have 384 cores so 3840 child processes running
On Wed, Sep 13, 2017 at 7:12 PM, Tim Chen wrote:
>
> BTW, will you be merging these 2 patches in 4.14?
Yes, and thanks for reminding me.
In fact, would you mind sending me the latest versions, rather than me
digging them out of the disaster area that is my mailbox and possibly
picking an older v
On 08/29/2017 09:24 AM, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 9:13 AM, Tim Chen wrote:
>>
>> It is affecting not a production use, but the customer's acceptance
>> test for their systems. So I suspect it is a stress test.
>
> Can you gently poke them and ask if they might make theie st
On 08/29/2017 09:24 AM, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 9:13 AM, Tim Chen wrote:
>>
>> It is affecting not a production use, but the customer's acceptance
>> test for their systems. So I suspect it is a stress test.
>
> Can you gently poke them and ask if they might make theie st
On Tue, Aug 29, 2017 at 9:13 AM, Tim Chen wrote:
>
> It is affecting not a production use, but the customer's acceptance
> test for their systems. So I suspect it is a stress test.
Can you gently poke them and ask if they might make theie stress test
code available?
Tell them that we have a fix
On Tue, Aug 29, 2017 at 9:17 AM, Tim Chen wrote:
>
> BTW, are you going to add the chunk below separately as part of your
> wait queue cleanup patch?
I did.
Commit 9c3a815f471a ("page waitqueue: always add new entries at the end")
Linus
On 08/28/2017 09:48 AM, Linus Torvalds wrote:
> On Mon, Aug 28, 2017 at 7:51 AM, Liang, Kan wrote:
>>
>> I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
>> But they don't fix the issue. I can still get the similar call stack.
>
> So the main issue was that I *really* hated Ti
On 08/29/2017 09:01 AM, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 5:57 AM, Liang, Kan wrote:
>>>
>>> Attached is an ALMOST COMPLETELY UNTESTED forward-port of those two
>>> patches, now without that nasty WQ_FLAG_ARRIVALS logic, because we now
>>> always put the new entries at the end of the
On Tue, Aug 29, 2017 at 5:57 AM, Liang, Kan wrote:
>>
>> Attached is an ALMOST COMPLETELY UNTESTED forward-port of those two
>> patches, now without that nasty WQ_FLAG_ARRIVALS logic, because we now
>> always put the new entries at the end of the waitqueue.
>
> The patches fix the long wait issue.
> On Mon, Aug 28, 2017 at 7:51 AM, Liang, Kan wrote:
> >
> > I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
> > But they don't fix the issue. I can still get the similar call stack.
>
> So the main issue was that I *really* hated Tim's patch #2, and the patch to
> clean up t
On 08/28/2017 09:48 AM, Linus Torvalds wrote:
> On Mon, Aug 28, 2017 at 7:51 AM, Liang, Kan wrote:
>>
>> I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
>> But they don't fix the issue. I can still get the similar call stack.
>
> So the main issue was that I *really* hated Ti
On Mon, Aug 28, 2017 at 7:51 AM, Liang, Kan wrote:
>
> I tried this patch and https://lkml.org/lkml/2017/8/27/222 together.
> But they don't fix the issue. I can still get the similar call stack.
So the main issue was that I *really* hated Tim's patch #2, and the
patch to clean up the page wait q
.
> On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds foundation.org> wrote:
> >
> > Simplify, simplify, simplify.
>
> I've now tried three different approaches (I stopped sending broken patches
> after deciding the first one was unfixable), and they all suck.
>
> I was hoping for something lock
On Sun, 27 Aug 2017 22:17:55 -0700
Linus Torvalds wrote:
> On Sun, Aug 27, 2017 at 6:29 PM, Nicholas Piggin wrote:
> >
> > BTW. since you are looking at this stuff, one other small problem I remember
> > with exclusive waiters is that losing to a concurrent locker puts them to
> > the back of th
On Sun, Aug 27, 2017 at 6:29 PM, Nicholas Piggin wrote:
>
> BTW. since you are looking at this stuff, one other small problem I remember
> with exclusive waiters is that losing to a concurrent locker puts them to
> the back of the queue. I think that could be fixed with some small change to
> the
On Mon, 28 Aug 2017 11:16:48 +1000
Nicholas Piggin wrote:
> On Sun, 27 Aug 2017 16:12:19 -0700
> Linus Torvalds wrote:
>
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index baba290c276b..0b41c8cbeabc 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -986,10 +98
On Sun, 27 Aug 2017 16:12:19 -0700
Linus Torvalds wrote:
> On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds
> wrote:
> >
> > The race goes like this:
> >
> > thread1 thread2 thread3
> >
> >
> > .. CPU1 ...
> > __lock_page_killable
> > w
On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds
wrote:
>
> The race goes like this:
>
> thread1 thread2 thread3
>
>
> .. CPU1 ...
> __lock_page_killable
> wait_on_page_bit_common()
> get wq lock
> __add_wait_queue_entry_tail_
On Sun, Aug 27, 2017 at 2:40 PM, Linus Torvalds
wrote:
>
> End result: page is unlocked, CPU3 is waiting, nothing will wake CPU3 up.
Not CPU3. CPU3 was the waker. It's thread 2 that is waiting and never
got woken up, of course.
Other than that, the scenario still looks real to me.
Ideas?
On Sat, Aug 26, 2017 at 11:15 AM, Linus Torvalds
wrote:
>
> So how about just this fairly trivial patch?
So I just committed that trivial patch, because I think it's right,
but more importantly because I think I found a real and non-trivial
fundamental problem.
The reason I found it is actually
On Fri, Aug 25, 2017 at 7:54 PM, Linus Torvalds
wrote:
>
> Simplify, simplify, simplify.
I've now tried three different approaches (I stopped sending broken
patches after deciding the first one was unfixable), and they all
suck.
I was hoping for something lockless to avoid the whole issue with
l
On Fri, Aug 25, 2017 at 5:31 PM, Linus Torvalds
wrote:
>
> It made it way more fragile and complicated, having to rewrite things
> so carefully. A simple slab cache would likely be a lot cleaner and
> simpler.
It also turns out that despite all the interfaces, we only really ever
wait on two diff
On Fri, Aug 25, 2017 at 4:03 PM, Linus Torvalds
wrote:
>
> Let this be a lesson in just *how* little tested, and *how* crap that
> patch probably still is.
I haven't had time to look at it any more (trying to merge the pull
requests that came in today instead), but the more I think about it,
the
On Fri, Aug 25, 2017 at 3:51 PM, Linus Torvalds
wrote:
>
> So take it as that: example pseudo-code that happens to pass a
> compiler, but is meant as a RFD rather than actually working.
Oh, and after I sent it out, I wanted to look once again, and realized
that the "remove_myself_from()" function
On Fri, Aug 25, 2017 at 3:19 PM, Tim Chen wrote:
>
> Also I think patch 1 is still a good idea for a fail safe mechanism
> in case there are other long wait list.
Yeah, I don't hate patch #1.
But that other patch is just nasty.
> That said, I do think your suggested approach is cleaner. Howeve
On 08/25/2017 12:58 PM, Linus Torvalds wrote:
> On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen wrote:
>> Now that we have added breaks in the wait queue scan and allow bookmark
>> on scan position, we put this logic in the wake_up_page_bit function.
>
> Oh, _this_ is the other patch you were talking a
On Fri, Aug 25, 2017 at 9:13 AM, Tim Chen wrote:
> Now that we have added breaks in the wait queue scan and allow bookmark
> on scan position, we put this logic in the wake_up_page_bit function.
Oh, _this_ is the other patch you were talking about. I thought it was
the NUMA counter threashold tha
Now that we have added breaks in the wait queue scan and allow bookmark
on scan position, we put this logic in the wake_up_page_bit function.
We can have very long page wait list in large system where multiple
pages share the same wait list. We break the wake up walk here to allow
other cpus a cha
30 matches
Mail list logo