On Mon, Nov 23, 2020 at 4:01 AM Michal Hocko <mho...@suse.com> wrote:
>
> On Fri 20-11-20 15:27:46, Pavel Tatashin wrote:
> > Recently, I encountered a hang that is happening during memory hot
> > remove operation. It turns out that the hang is caused by pinned user
> > pages in ZONE_MOVABLE.
> >
> > Kernel expects that all pages in ZONE_MOVABLE can be migrated, but
> > this is not the case if a user applications such as through dpdk
> > libraries pinned them via vfio dma map.
>
> Long term or effectively time unbound pinning on zone movable is
> fundamentaly broken. The sole reason of ZONE_MOVABLE existence is to
> guarantee migrateability. If the cosumer of this memory cannot guarantee
> that then it shouldn't use __GFP_MOVABLE in the first place.

Exactly, this is what I am trying to solve, and started this thread to
figure out what is the best approach to address this problem.

>
> > Kernel keeps trying to
> > hot-remove them, but refcnt never gets to zero, so we are looping
> > until the hardware watchdog kicks in.
>
> Yeah, the existing offlining behavior doesn't stop trying because the
> current implementation of the migration cannot tell a diffence between
> short and long term failures. Maybe the recent ref count for long term
> pinning can be used to help out there.
>
> Anyway, I am wondering what do you mean by watchdog firing. The
> operation should trigger neither of soft, hard or hung detectors.

You are right, the hot-remove is killable operation. In our case,
however, systemd stops petting watchdog during kexec reboot to ensure
that reboot finishes, however, because we hot-remove memory during
shutdown, and kernel is unable to hot-remove memory within 60s we get
a watchdog reset.

>
> > We cannot do dma unmaps before hot-remove, because hot-remove is a
> > slow operation, and we have thousands for network flows handled by
> > dpdk that we just cannot suspend for the duration of hot-remove
> > operation.
> >
> > The solution is for dpdk to allocate pages from a zone below
> > ZONE_MOVAVLE, i.e. ZONE_NORMAL/ZONE_HIGHMEM, but this is not possible.
> > There is no user interface that we have that allows applications to
> > select what zone the memory should come from.
>
> Our existing interface is __GFP_MOVABLE. It is a responsibility of the
> driver to know whether the resulting memory is migratable. Users
> shouldn't even have to think about that.

Sure, so let's migrate, and fault memory from drivers when long term
pinning. Which is 1 and 2 in my proposal.

> > I've spoken with Stephen Hemminger, and he said that DPDK is moving in
> > the direction of using transparent huge pages instead of HugeTLBs,
> > which means that we need to allow at least anonymous, and anonymous
> > transparent huge pages to come from non-movable zones on demand.
>
> You can migrate before pinning.

Yes.

>
> > Here is what I am proposing:
> > 1. Add a new flag that is passed through pin_user_pages_* down to
> > fault handlers, and allow the fault handler to allocate from a
> > non-movable zone.
>
> gup already tries to deal with long term pins on CMA regions and migrate
> to a non CMA region. Have a look at __gup_longterm_locked. Migrating of
> the movable zone sounds like a reasonable solution to me.

Yes, CMA is doing something similar, but it is migrating before
pinning from CMA to movable zone to avoid fragmentation of CMA. What
we need to do is migrate before pinning to a non-movable zone for all
pages.

>
> > 2. Add an internal move_pages_zone() similar to move_pages() syscall
> > but instead of migrating to a different NUMA node, migrate pages from
> > ZONE_MOVABLE to another zone.
> > Call move_pages_zone() on demand prior to pinning pages from
> > vfio_pin_map_dma() for instance.
>
> Why is the existing migration API insufficient?

Here I am talking about internal implementation not user API. We do
not have a function that migrates pages in a user address space from
one zone to another zone. We only have a function that is exposed as a
syscall that migrates pages from one node to another node.

>
> > 3. Perhaps, it also makes sense to add madvise() flag, to allocate
> > pages from non-movable zone. When a user application knows that it
> > will do DMA mapping, and pin pages for a long time, the memory that it
> > allocates should never be migrated or hot-removed, so make sure that
> > it comes from the appropriate place.
> > The benefit of adding madvise() flag is that we won't have to deal
> > with slow page migration during pin time, but the disadvantage is that
> > we would need to change the user interface.
>
> No, the MOVABLE_ZONE like other zone types are internal implementation
> detail of the MM. I do not think we want to expose that to the userspace
> and carve this into stone.

What I mean here is allowing users to guarantee that the page's PA is
going to stay the same. Sort of a stronger mlock. Mlock only
guarantees that the page is not swapped, but something like
MADV_PINNED would guarantee that page is not going to be swapped and
also not migrated. If a user determines the PA of that page, that PA
is going to stay the same throughout the life of the page. This is not
exposing internal implementation in any way, this guarantee could be
honored in various ways: i.e. pinned or allocating from ZONE_NORMAL.
The fact that we would honor it by allocating memory from ZONE_NORMAL
is implementation detail that would not be exposed to the user.

This is from DPDK's description:
https://software.intel.com/content/www/us/en/develop/articles/memory-in-dpdk-part-1-general-concepts.html

"
Whenever a memory area is made available for DPDK to use, DPDK figures
out its physical address by asking the kernel at that time. Since DPDK
uses pinned memory, generally in the form of huge pages, the physical
address of the underlying memory area is not expected to change, so
the hardware can rely on those physical addresses to be valid at all
times, even if the memory itself is not used for some time. DPDK then
uses these physical addresses when preparing I/O transactions to be
done by the hardware, and configures the hardware in such a way that
the hardware is allowed to initiate DMA transactions itself. This
allows DPDK to avoid needless overhead and to perform I/O entirely
from user space.
"

I just think it is inefficient to first allocate memory from
ZONE_MOVABLE, and later migrate it to ZONE_NORMAL.

That said, I agree, we probably should not be adding a new flag at
least as part of this work.

Reply via email to