On 19-Dec-17 11:14 AM, Anatoly Burakov wrote:
This patchset introduces a prototype implementation of dynamic memory allocation
for DPDK. It is intended to start a conversation and build consensus on the best
way to implement this functionality. The patchset works well enough to pass all
unit tests, and to work with traffic forwarding, provided the device drivers are
adjusted to ensure contiguous memory allocation where it matters.

The vast majority of changes are in the EAL and malloc, the external API
disruption is minimal: a new set of API's are added for contiguous memory
allocation (for rte_malloc and rte_memzone), and a few API additions in
rte_memory. Every other API change is internal to EAL, and all of the memory
allocation/freeing is handled through rte_malloc, with no externally visible
API changes, aside from a call to get physmem layout, which no longer makes
sense given that there are multiple memseg lists.

Quick outline of all changes done as part of this patchset:

  * Malloc heap adjusted to handle holes in address space
  * Single memseg list replaced by multiple expandable memseg lists
  * VA space for hugepages is preallocated in advance
  * Added dynamic alloc/free for pages, happening as needed on malloc/free
  * Added contiguous memory allocation API's for rte_malloc and rte_memzone
  * Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
    with VFIO

The biggest difference is a "memseg" now represents a single page (as opposed to
being a big contiguous block of pages). As a consequence, both memzones and
malloc elements are no longer guaranteed to be physically contiguous, unless
the user asks for it. To preserve whatever functionality that was dependent
on previous behavior, a legacy memory option is also provided, however it is
expected to be temporary solution. The drivers weren't adjusted in this 
patchset,
and it is expected that whoever shall test the drivers with this patchset will
modify their relevant drivers to support the new set of API's. Basic testing
with forwarding traffic was performed, both with UIO and VFIO, and no 
performance
degradation was observed.

Why multiple memseg lists instead of one? It makes things easier on a number of
fronts. Since memseg is a single page now, the list will get quite big, and we
need to locate pages somehow when we allocate and free them. We could of course
just walk the list and allocate one contiguous chunk of VA space for memsegs,
but i chose to use separate lists instead, to speed up many operations with the
list.

It would be great to see the following discussions within the community 
regarding
both current implementation and future work:

  * Any suggestions to improve current implementation. The whole system with
    multiple memseg lists is kind of unweildy, so maybe there are better ways to
    do the same thing. Maybe use a single list after all? We're not expecting
    malloc/free on hot path, so maybe it doesn't matter that we have to walk
    the list of potentially thousands of pages?
  * Pluggable memory allocators. Right now, allocators are hardcoded, but down
    the line it would be great to have custom allocators (e.g. for externally
    allocated memory). I've tried to keep the memalloc API minimal and generic
    enough to be able to easily change it down the line, but suggestions are
    welcome. Memory drivers, with ops for alloc/free etc.?
  * Memory tagging. This is related to previous item. Right now, we can only ask
    malloc to allocate memory by page size, but one could potentially have
    different memory regions backed by pages of similar sizes (for example,
    locked 1G pages, to completely avoid TLB misses, alongside regular 1G 
pages),
    and it would be good to have that kind of mechanism to distinguish between
    different memory types available to a DPDK application. One could, for 
example,
    tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
  * Secondary process implementation, in particular when it comes to allocating/
    freeing new memory. Current plan is to make use of RPC mechanism proposed by
    Jianfeng [2] to communicate between primary and secondary processes, however
    other suggestions are welcome.
  * Support for non-hugepage memory. This work is planned down the line. Aside
    from obvious concerns about physical addresses, 4K pages are small and will
    eat up enormous amounts of memseg list space, so my proposal would be to
    allocate 4K pages in bigger blocks (say, 2MB).
  * 32-bit support. Current implementation lacks it, and i don't see a trivial
    way to make it work if we are to preallocate huge chunks of VA space in
    advance. We could limit it to 1G per page size, but even that, on multiple
    sockets, won't work that well, and we can't know in advance what kind of
    memory user will try to allocate. Drop it? Leave it in legacy mode only?
  * Preallocation. Right now, malloc will free any and all memory that it can,
    which could lead to a (perhaps counterintuitive) situation where a user
    calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses
    all of the preallocated memory in the process. Would preallocating memory
    *and keeping it no matter what* be a valid use case? E.g. if DPDK was run
    without any memory requirements specified, grow and shrink as needed, but
    DPDK was asked to preallocate memory, we can grow but we can't shrink
    past the preallocated amount?

Any other feedback about things i didn't think of or missed is greatly
appreciated.

[1] http://dpdk.org/dev/patchwork/patch/24484/
[2] http://dpdk.org/dev/patchwork/patch/31838/

Hi all,

Could this proposal be discussed at the next tech board meeting?

--
Thanks,
Anatoly

Reply via email to