On 20-Mar-18 2:18 PM, Olivier Matz wrote:
Hi,

On Tue, Mar 20, 2018 at 01:51:31PM +0000, Burakov, Anatoly wrote:
On 20-Mar-18 12:42 PM, Olivier Matz wrote:
On Tue, Mar 20, 2018 at 10:27:55AM +0000, Burakov, Anatoly wrote:
On 19-Mar-18 5:30 PM, Olivier Matz wrote:
Hi Anatoly,

On Sat, Mar 03, 2018 at 01:45:48PM +0000, Anatoly Burakov wrote:
This patchset introduces dynamic memory allocation for DPDK (aka memory
hotplug). Based upon RFC submitted in December [1].

Dependencies (to be applied in specified order):
- IPC bugfixes patchset [2]
- IPC improvements patchset [3]
- IPC asynchronous request API patch [4]
- Function to return number of sockets [5]

Deprecation notices relevant to this patchset:
- General outline of memory hotplug changes [6]
- EAL NUMA node count changes [7]

The vast majority of changes are in the EAL and malloc, the external API
disruption is minimal: a new set of API's are added for contiguous memory
allocation for rte_memzone, and a few API additions in rte_memory due to
switch to memseg_lists as opposed to memsegs. Every other API change is
internal to EAL, and all of the memory allocation/freeing is handled
through rte_malloc, with no externally visible API changes.

Quick outline of all changes done as part of this patchset:

    * Malloc heap adjusted to handle holes in address space
    * Single memseg list replaced by multiple memseg lists
    * VA space for hugepages is preallocated in advance
    * Added alloc/free for pages happening as needed on rte_malloc/rte_free
    * Added contiguous memory allocation API's for rte_memzone
    * Integrated Pawel Wodkowski's patch for registering/unregistering memory
      with VFIO [8]
    * Callbacks for registering memory allocations
    * Multiprocess support done via DPDK IPC introduced in 18.02

The biggest difference is a "memseg" now represents a single page (as opposed to
being a big contiguous block of pages). As a consequence, both memzones and
malloc elements are no longer guaranteed to be physically contiguous, unless
the user asks for it at reserve time. To preserve whatever functionality that
was dependent on previous behavior, a legacy memory option is also provided,
however it is expected (or perhaps vainly hoped) to be temporary solution.

Why multiple memseg lists instead of one? Since memseg is a single page now,
the list of memsegs will get quite big, and we need to locate pages somehow
when we allocate and free them. We could of course just walk the list and
allocate one contiguous chunk of VA space for memsegs, but this
implementation uses separate lists instead in order to speed up many
operations with memseg lists.

For v1, the following limitations are present:
- FreeBSD does not even compile, let alone run
- No 32-bit support
- There are some minor quality-of-life improvements planned that aren't
     ready yet and will be part of v2
- VFIO support is only smoke-tested (but is expected to work), VFIO support
     with secondary processes is not tested; work is ongoing to validate VFIO
     for all use cases
- Dynamic mapping/unmapping memory with VFIO is not supported in sPAPR
     IOMMU mode - help from sPAPR maintainers requested

Nevertheless, this patchset should be testable under 64-bit Linux, and
should work for all use cases bar those mentioned above.

[1] http://dpdk.org/dev/patchwork/bundle/aburakov/Memory_RFC/
[2] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Fixes/
[3] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Improvements/
[4] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Async_Request/
[5] http://dpdk.org/dev/patchwork/bundle/aburakov/Num_Sockets/
[6] http://dpdk.org/dev/patchwork/patch/34002/
[7] http://dpdk.org/dev/patchwork/patch/33853/
[8] http://dpdk.org/dev/patchwork/patch/24484/

I did a quick pass on your patches (unfortunately, I don't have
the time to really dive in it).

I have few questions/comments:

- This is really a big patchset. Thank you for working on this topic.
     I'll try to test our application with it as soon as possible.

- I see from patch 17 that it is possible that rte_malloc() expands
     the heap by requesting more memory to the OS? Did I understand well?
     Today, a good property of rte_malloc() compared to malloc() is that
     it won't interrupt the process (the worst case is a spinlock). This
     is appreciable on a dataplane core. Will it change?

Hi Olivier,

Not sure what you mean by "interrupt the process". The new rte_malloc will
_mostly_ work just like the old one. There are now two levels of locks: the
heap lock, and the system allocation lock. If your rte_malloc call requests
amount of memory that can be satisfied by already allocated memory, then
only the heap lock is engaged - or, to put it in other words, things work as
before.

When you *don't* have enough memory allocated, previously rte_malloc would
just fail. Now, it instead will lock the second lock and try to allocate
more memory from the system. This requires IPC (to ensure all processes have
allocated/freed the same memory), so this will take way longer (timeout is
set to wait up to 5 seconds, although under normal circumstances it's taking
a lot less - depending on how many processes you have running, but generally
under 100ms), and will block other system allocations (i.e. if another
rte_malloc call on another heap is trying to request more memory from the
system).

So, in short - you can't allocate from the same heap in parallel (same as
before), and you can't have parallel system memory allocation requests
(regardless of from which heap it comes from). The latter *only* applies to
system memory allocations - that is, if one heap is allocating system memory
while another heap receives allocation request *and is able to satisfy it
from already allocated memory*, it will not block, because the second lock
is never engaged.

OK. Let's imagine you are using rte_malloc() on a dataplane core, and
you run out of memory. Previously, the allocation would just fail. Now,
if my understanding is correct, it can block for a long time, which can
be a problem on a dataplane core, because it will cause packet losses,
especially if it also blocks allocations on other cores during that
time. In this case, it could be useful to make the dynamic heap resizing
feature optional.

Why would anyone in their right mind call rte_malloc on fast path? If you're
referring to mempool allocations/deallocations, then this is a completely
separate subject, as mempool alloc/free is not handled by rte_malloc but is
handled by rte_mempool itself - as far as rte_malloc is concerned, that
memory is already allocated and it will not touch it.

As for "making heap resizing feature optional", i'm working on functionality
that would essentially enable that. Specifically, i'm adding API's to set
allocation limits and a callback which will get triggered once allocator
tries to allocate beyond said limits, with an option of returning -1 and
thus preventing this allocation from completing. While this is kind of a
round-about way of doing it, it would have similar effect.

Calling rte_malloc() in the data path may be required in case the
application needs to allocate an unknown-sized object. I'm not saying
it's a usual or an optimal use case, I just say that it happens.

Waiting for a spinlock is acceptable in datapath, if it is held by
another dataplane core.
Waiting for several hundreds of ms is not an option in that case.

If the feature is going to be optional, it's perfectly fine for me.

Well, there's always an option of running in "legacy mem" mode, which disables memory hotplug completely and will essentially behave like it does right now (allocate VA and IOVA-contiguous segments).

But yes, with said allocation limits API you will essentially be able to control which allocations succeed and which don't. It's not exactly "making it optional", but you can have control over system memory allocations that would enable that. For example, at init you allocate all your necessary data structures, and then you set the memory allocation limits in such a way that you can neither allocate nor deallocate any pages whatsoever once you start up your fast-path. This way, regular malloc will still work, but any page allocation/deallocation request will not go through.



I have another question about the patchset. Today, it is not really
possible for an application to allocate a page. If you want a full page
(ex: 2M), you need to allocate 4M because the rte_malloc layer adds a
header before the allocated memory. Therefore, if the memory is
fragmented a lot with only 2M pages, you cannot allocate them as pages.

It is possible, with your patchset or in the future, to have an access
to a page-based allocator? The use-case is to be able for an application
to ask for pages in dpdk memory and remap them in a virtually contiguous
memory.

Pages returned from our allocator are already virtually contiguous, there is
no need to do any remapping. If user specifies proper size and alignment
(i.e. reserve a memzone with RTE_MEMZONE_2MB and with 2M size and
alignment), it will essentially cause the allocator to return a memzone
that's exactly page-size long. Yes, in the background, it will allocate
another page to store malloc metadata, and yes, memory will become
fragmented if multiple such allocations will occur. It is not possible
(neither now nor in the future planned work) to do what you describe unless
we store malloc data separately from allocated memory (which can be done,
but is a non-trivial amount of work).

Malloc stores its metadata right in the hugepage mostly for multiprocess
purposes - so that the entire heap is always shared between all processes.
If we want to store malloc metadata separately from allocated memory, a
replacement mechanism to shared heap metadata will need to be put in place
(which, again, can be done, but is a non-trivial amount of work - arguably
for questionable gain).

That said, use case you have described is already possible - just allocate
multiple pages from DPDK as a memzone, and overlay your own memory allocator
over that memory. This will have the same effect.

Yes, that's currently what I'm doing: to get one 2M page, I'm allocating
more 2M with 2M alignement, which actually results in 4M allocation. My
problem today is when the huge pages are already fragmented at dpdk
start (i.e. only isolated pages). So an allocation of > 2M would fail.

So your patchset mostly solves that issue, because rte_malloc() does not
request physically contiguous memory anymore, which means that
physically isolated hugepages are now virtually contiguous, right? So
rte_malloc(4M) will always be succesful until the memory is virtually
fragmented (i.e. after several malloc/free).

Yes, that is correct. We preallocate all VA space in advance, so unless you fragment your VA space by making multiple allocations in this way up to a point where you run out of pages, you should be OK.

As i said, it is possible to rewrite the heap in a way that will do away with storing metadata in-place, and that will solve some of the tricky issues with memory allocator (such as pad elements, which require special handling everywhere), however this metadata still has to be stored somewhere in shared memory in order to be shared across processes, and that poses a problem because at some point we may hit a condition where we have plenty of free space but have exhausted our malloc element list and cannot allocate more (and we can't realloc because, well, multiprocess). So, such a scenario will come with its own set of challenges. Sadly, there's no free lunch :(


Thank you for the clarification.



--
Thanks,
Anatoly

Reply via email to