On Wed, 21 Nov 2018, Michal Hocko wrote:
> On Mon 19-11-18 21:44:41, Hugh Dickins wrote:
> [...]
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> >
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that
On Mon 19-11-18 21:44:41, Hugh Dickins wrote:
[...]
> [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE
On Tue, 20 Nov 2018, Hugh Dickins wrote:
> On Tue, 20 Nov 2018, Vlastimil Babka wrote:
> > >
> > > finish_wait(q, wait);
> >
> > ... the code continues by:
> >
> > if (thrashing) {
> > if (!PageSwapBacked(page))
> >
> > So maybe we should not set 'thrashing' true when
On Tue, 20 Nov 2018, Baoquan He wrote:
> On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> > On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > >
> > > We have all assumed that it is essential to hold a page reference while
> > > wait
On Tue, 20 Nov 2018, Vlastimil Babka wrote:
> On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> >
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that there is
On 11/20/18 at 03:05pm, Michal Hocko wrote:
> > Yes, I applied Hugh's patch 8 hours ago, then our QE Ping operated on
> > that machine, after many times of hot removing/adding, the endless
> > looping during mirgrating is not seen any more. The test result for
> > Hugh's patch is positive. I even s
On Tue 20-11-18 21:58:03, Baoquan He wrote:
> Hi,
>
> On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> > On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> > >
> > > We have all assumed that it is essential to hold a page reference wh
Hi,
On 11/20/18 at 02:38pm, Vlastimil Babka wrote:
> On 11/20/18 6:44 AM, Hugh Dickins wrote:
> > [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
> >
> > We have all assumed that it is essential to hold a page reference while
> > waiting on a page lock: partly to guarantee that t
On 11/20/18 6:44 AM, Hugh Dickins wrote:
> [PATCH] mm: put_and_wait_on_page_locked() while page is migrated
>
> We have all assumed that it is essential to hold a page reference while
> waiting on a page lock: partly to guarantee that there is still a struct
> page when MEMORY_HOTREMOVE is configu
On Tue, 20 Nov 2018, Baoquan He wrote:
> On 11/19/18 at 09:59pm, Michal Hocko wrote:
> > On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> > > I'm glad that I delayed, what I had then (migration_waitqueue instead
> > > of using page_waitqueue) was not wrong, but what I've been using the
> > > last co
On 11/19/18 at 09:59pm, Michal Hocko wrote:
> On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> > I'm glad that I delayed, what I had then (migration_waitqueue instead
> > of using page_waitqueue) was not wrong, but what I've been using the
> > last couple of months is rather better (and can be put t
On Mon 19-11-18 12:34:09, Hugh Dickins wrote:
> On Mon, 19 Nov 2018, Michal Hocko wrote:
> > On Mon 19-11-18 15:10:16, Michal Hocko wrote:
> > [...]
> > > In other words. Why cannot we do the following?
> >
> > Baoquan, this is certainly not the right fix but I would be really
> > curious whether
On Mon, 19 Nov 2018, Michal Hocko wrote:
> On Mon 19-11-18 15:10:16, Michal Hocko wrote:
> [...]
> > In other words. Why cannot we do the following?
>
> Baoquan, this is certainly not the right fix but I would be really
> curious whether it makes the problem go away.
>
> > diff --git a/mm/migrate
On Mon 19-11-18 15:10:16, Michal Hocko wrote:
[...]
> In other words. Why cannot we do the following?
Baoquan, this is certainly not the right fix but I would be really
curious whether it makes the problem go away.
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..7ccab29bcf9a 1006
On Mon 19-11-18 17:48:35, Vlastimil Babka wrote:
> On 11/19/18 5:46 PM, Vlastimil Babka wrote:
> > On 11/19/18 5:46 PM, Michal Hocko wrote:
> >> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
> >>>
> >>> So what protects us from locking a page whose refcount dropped to zero?
> >>> and is being fr
On 11/19/18 5:46 PM, Vlastimil Babka wrote:
> On 11/19/18 5:46 PM, Michal Hocko wrote:
>> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
>>>
>>> So what protects us from locking a page whose refcount dropped to zero?
>>> and is being freed? The checks in freeing path won't be happy about a
>>> st
On 11/19/18 5:46 PM, Michal Hocko wrote:
> On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
>>
>> So what protects us from locking a page whose refcount dropped to zero?
>> and is being freed? The checks in freeing path won't be happy about a
>> stray lock.
>
> Nothing really prevents that. But do
On Mon 19-11-18 17:36:21, Vlastimil Babka wrote:
> On 11/19/18 3:10 PM, Michal Hocko wrote:
> > On Mon 19-11-18 13:51:21, Michal Hocko wrote:
> >> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> >>> How are
> >>> we supposed to converge when the swapin code waits for the migration to
> >>> finish w
On 11/19/18 3:10 PM, Michal Hocko wrote:
> On Mon 19-11-18 13:51:21, Michal Hocko wrote:
>> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
>>> How are
>>> we supposed to converge when the swapin code waits for the migration to
>>> finish with the reference count elevated?
Indeed this looks wrong. H
On Mon 19-11-18 13:51:21, Michal Hocko wrote:
> On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> > On Mon 19-11-18 18:52:02, Baoquan He wrote:
> > [...]
> >
> > There are few stacks directly in the offline path but those should be
> > OK.
> > The real culprit seems to be the swap in code
> >
> > >
On Mon 19-11-18 13:40:33, Michal Hocko wrote:
> On Mon 19-11-18 18:52:02, Baoquan He wrote:
> [...]
>
> There are few stacks directly in the offline path but those should be
> OK.
> The real culprit seems to be the swap in code
>
> > [ +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G
On Mon 19-11-18 18:52:02, Baoquan He wrote:
[...]
There are few stacks directly in the offline path but those should be
OK.
The real culprit seems to be the swap in code
> [ +1.734416] CPU: 255 PID: 5558 Comm: stress Tainted: G L
> 4.20.0-rc2+ #7
> [ +0.007927] Hardware name: 9
On 11/16/18 at 10:14am, Michal Hocko wrote:
> Could you try to apply this debugging patch on top please? It will dump
> stack trace for each reference count elevation for one page that fails
> to migrate after multiple passes.
Thanks, applied and fixed two code issues. The dmesg has been sent to
y
On Fri 16-11-18 09:24:33, Baoquan He wrote:
> On 11/15/18 at 03:32pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > >
On 11/15/18 at 03:32pm, Michal Hocko wrote:
> On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > [...]
> > > > > It would be also good to find out whe
On 11/15/18 at 03:32pm, Michal Hocko wrote:
> On Thu 15-11-18 21:38:40, Baoquan He wrote:
> > On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > > [...]
> > > > > It would be also good to find out whe
On Thu 15-11-18 21:38:40, Baoquan He wrote:
> On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > [...]
> > > > It would be also good to find out whether this is fs specific. E.g. does
> > > > it make any
On Thu 15-11-18 21:23:42, Baoquan He wrote:
> On 11/15/18 at 02:19pm, Michal Hocko wrote:
> > On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > > On 11/15/18 at 09:30am, Michal Hocko wrote:
> > [...]
> > > > It would be also good to find out whether this is fs specific. E.g. does
> > > > it make any
On 11/15/18 at 02:19pm, Michal Hocko wrote:
> On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > On 11/15/18 at 09:30am, Michal Hocko wrote:
> [...]
> > > It would be also good to find out whether this is fs specific. E.g. does
> > > it make any difference if you use a different one for your stress
>
On 11/15/18 at 02:19pm, Michal Hocko wrote:
> On Thu 15-11-18 21:12:11, Baoquan He wrote:
> > On 11/15/18 at 09:30am, Michal Hocko wrote:
> [...]
> > > It would be also good to find out whether this is fs specific. E.g. does
> > > it make any difference if you use a different one for your stress
>
On Thu 15-11-18 21:12:11, Baoquan He wrote:
> On 11/15/18 at 09:30am, Michal Hocko wrote:
[...]
> > It would be also good to find out whether this is fs specific. E.g. does
> > it make any difference if you use a different one for your stress
> > testing?
>
> Created a ramdisk and put stress bin t
On 11/15/18 at 09:30am, Michal Hocko wrote:
> On Thu 15-11-18 15:53:56, Baoquan He wrote:
> > On 11/15/18 at 08:30am, Michal Hocko wrote:
> > > On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > > > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > >
On 15.11.18 10:52, Baoquan He wrote:
> On 11/15/18 at 10:42am, David Hildenbrand wrote:
>> I am wondering why it is always the last memory block of that device
>> (and even that node). Coincidence?
>
> I remember one or two times it's the last 6G or 4G which stall there,
> the size of memory block
On 11/15/18 at 10:42am, David Hildenbrand wrote:
> I am wondering why it is always the last memory block of that device
> (and even that node). Coincidence?
I remember one or two times it's the last 6G or 4G which stall there,
the size of memory block is 2G. But most of time it's the last memory
b
On 15.11.18 09:30, Michal Hocko wrote:
> On Thu 15-11-18 15:53:56, Baoquan He wrote:
>> On 11/15/18 at 08:30am, Michal Hocko wrote:
>>> On Thu 15-11-18 13:10:34, Baoquan He wrote:
On 11/14/18 at 04:00pm, Michal Hocko wrote:
> On Wed 14-11-18 22:52:50, Baoquan He wrote:
>> On 11/14/18 a
On Thu 15-11-18 15:53:56, Baoquan He wrote:
> On 11/15/18 at 08:30am, Michal Hocko wrote:
> > On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > >
On 11/15/18 at 08:30am, Michal Hocko wrote:
> On Thu 15-11-18 13:10:34, Baoquan He wrote:
> > On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > > I have seen an issue when the migration cannot ma
On Thu 15-11-18 13:10:34, Baoquan He wrote:
> On 11/14/18 at 04:00pm, Michal Hocko wrote:
> > On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > > I have seen an issue when the migration cannot make a forward progress
> > > > because of a glibc page
On 11/14/18 at 04:00pm, Michal Hocko wrote:
> On Wed 14-11-18 22:52:50, Baoquan He wrote:
> > On 11/14/18 at 10:01am, Michal Hocko wrote:
> > > I have seen an issue when the migration cannot make a forward progress
> > > because of a glibc page with a reference count bumping up and down. Most
> > >
On Wed 14-11-18 22:52:50, Baoquan He wrote:
> On 11/14/18 at 10:01am, Michal Hocko wrote:
> > I have seen an issue when the migration cannot make a forward progress
> > because of a glibc page with a reference count bumping up and down. Most
> > probable explanation is the faultaround code. I am wo
On 11/14/18 at 10:01am, Michal Hocko wrote:
> I have seen an issue when the migration cannot make a forward progress
> because of a glibc page with a reference count bumping up and down. Most
> probable explanation is the faultaround code. I am working on this and
> will post a patch soon. In any c
On Wed 14-11-18 10:48:09, David Hildenbrand wrote:
> On 14.11.18 10:41, Michal Hocko wrote:
> > On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
> >> On 14.11.18 10:00, Baoquan He wrote:
> >>> Hi David,
> >>>
> >>> On 11/14/18 at 09:18am, David Hildenbrand wrote:
> Code seems to be waiting f
[Cc Vladimir]
On Wed 14-11-18 15:09:09, Baoquan He wrote:
> Hi,
>
> Tested memory hotplug on a bare metal system, hot removing always
> trigger a lock. Usually need hot plug/unplug several times, then the hot
> removing will hang there at the last block. Surely with memory pressure
> added by exe
On 14.11.18 10:41, Michal Hocko wrote:
> On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
>> On 14.11.18 10:00, Baoquan He wrote:
>>> Hi David,
>>>
>>> On 11/14/18 at 09:18am, David Hildenbrand wrote:
Code seems to be waiting for the mem_hotplug_lock in read.
We hold mem_hotplug_lock in
On Wed 14-11-18 10:25:57, David Hildenbrand wrote:
> On 14.11.18 10:00, Baoquan He wrote:
> > Hi David,
> >
> > On 11/14/18 at 09:18am, David Hildenbrand wrote:
> >> Code seems to be waiting for the mem_hotplug_lock in read.
> >> We hold mem_hotplug_lock in write whenever we online/offline/add/rem
On Wed 14-11-18 10:22:31, David Hildenbrand wrote:
> >>
> >> The real question is, however, why offlining of the last block doesn't
> >> succeed. In __offline_pages() we basically have an endless loop (while
> >> holding the mem_hotplug_lock in write). Now I consider this piece of
> >> code very pr
>>> Failing on ENOMEM is a questionable thing. I haven't seen that happening
>>> wildly but if it is a case then I wouldn't be opposed.
>>>
You mentioned memory pressure, if our host is under memory pressure we
can easily trigger running into an endless loop there, because we
basical
On 14.11.18 10:00, Baoquan He wrote:
> Hi David,
>
> On 11/14/18 at 09:18am, David Hildenbrand wrote:
>> Code seems to be waiting for the mem_hotplug_lock in read.
>> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
>> memory. There are two ways to trigger offlining of memor
>>
>> The real question is, however, why offlining of the last block doesn't
>> succeed. In __offline_pages() we basically have an endless loop (while
>> holding the mem_hotplug_lock in write). Now I consider this piece of
>> code very problematic (we should automatically fail after X
>> attempts/a
On Wed 14-11-18 09:18:09, David Hildenbrand wrote:
> On 14.11.18 08:09, Baoquan He wrote:
> > Hi,
> >
> > Tested memory hotplug on a bare metal system, hot removing always
> > trigger a lock. Usually need hot plug/unplug several times, then the hot
> > removing will hang there at the last block. S
Hi David,
On 11/14/18 at 09:18am, David Hildenbrand wrote:
> Code seems to be waiting for the mem_hotplug_lock in read.
> We hold mem_hotplug_lock in write whenever we online/offline/add/remove
> memory. There are two ways to trigger offlining of memory:
>
> 1. Offlining via "cat offline > /sys/d
On 14.11.18 08:09, Baoquan He wrote:
> Hi,
>
> Tested memory hotplug on a bare metal system, hot removing always
> trigger a lock. Usually need hot plug/unplug several times, then the hot
> removing will hang there at the last block. Surely with memory pressure
> added by executing "stress -m 200"
52 matches
Mail list logo