On 19-Dec-17 11:14 AM, Anatoly Burakov wrote:
This patchset introduces a prototype implementation of dynamic memory allocation
for DPDK. It is intended to start a conversation and build consensus on the best
way to implement this functionality. The patchset works well enough to pass all
unit tests, and to work with traffic forwarding, provided the device drivers are
adjusted to ensure contiguous memory allocation where it matters.
The vast majority of changes are in the EAL and malloc, the external API
disruption is minimal: a new set of API's are added for contiguous memory
allocation (for rte_malloc and rte_memzone), and a few API additions in
rte_memory. Every other API change is internal to EAL, and all of the memory
allocation/freeing is handled through rte_malloc, with no externally visible
API changes, aside from a call to get physmem layout, which no longer makes
sense given that there are multiple memseg lists.
Quick outline of all changes done as part of this patchset:
* Malloc heap adjusted to handle holes in address space
* Single memseg list replaced by multiple expandable memseg lists
* VA space for hugepages is preallocated in advance
* Added dynamic alloc/free for pages, happening as needed on malloc/free
* Added contiguous memory allocation API's for rte_malloc and rte_memzone
* Integrated Pawel Wodkowski's patch [1] for registering/unregistering memory
with VFIO
The biggest difference is a "memseg" now represents a single page (as opposed to
being a big contiguous block of pages). As a consequence, both memzones and
malloc elements are no longer guaranteed to be physically contiguous, unless
the user asks for it. To preserve whatever functionality that was dependent
on previous behavior, a legacy memory option is also provided, however it is
expected to be temporary solution. The drivers weren't adjusted in this
patchset,
and it is expected that whoever shall test the drivers with this patchset will
modify their relevant drivers to support the new set of API's. Basic testing
with forwarding traffic was performed, both with UIO and VFIO, and no
performance
degradation was observed.
Why multiple memseg lists instead of one? It makes things easier on a number of
fronts. Since memseg is a single page now, the list will get quite big, and we
need to locate pages somehow when we allocate and free them. We could of course
just walk the list and allocate one contiguous chunk of VA space for memsegs,
but i chose to use separate lists instead, to speed up many operations with the
list.
It would be great to see the following discussions within the community
regarding
both current implementation and future work:
* Any suggestions to improve current implementation. The whole system with
multiple memseg lists is kind of unweildy, so maybe there are better ways to
do the same thing. Maybe use a single list after all? We're not expecting
malloc/free on hot path, so maybe it doesn't matter that we have to walk
the list of potentially thousands of pages?
* Pluggable memory allocators. Right now, allocators are hardcoded, but down
the line it would be great to have custom allocators (e.g. for externally
allocated memory). I've tried to keep the memalloc API minimal and generic
enough to be able to easily change it down the line, but suggestions are
welcome. Memory drivers, with ops for alloc/free etc.?
* Memory tagging. This is related to previous item. Right now, we can only ask
malloc to allocate memory by page size, but one could potentially have
different memory regions backed by pages of similar sizes (for example,
locked 1G pages, to completely avoid TLB misses, alongside regular 1G
pages),
and it would be good to have that kind of mechanism to distinguish between
different memory types available to a DPDK application. One could, for
example,
tag memory by "purpose" (i.e. "fast", "slow"), or in other ways.
* Secondary process implementation, in particular when it comes to allocating/
freeing new memory. Current plan is to make use of RPC mechanism proposed by
Jianfeng [2] to communicate between primary and secondary processes, however
other suggestions are welcome.
* Support for non-hugepage memory. This work is planned down the line. Aside
from obvious concerns about physical addresses, 4K pages are small and will
eat up enormous amounts of memseg list space, so my proposal would be to
allocate 4K pages in bigger blocks (say, 2MB).
* 32-bit support. Current implementation lacks it, and i don't see a trivial
way to make it work if we are to preallocate huge chunks of VA space in
advance. We could limit it to 1G per page size, but even that, on multiple
sockets, won't work that well, and we can't know in advance what kind of
memory user will try to allocate. Drop it? Leave it in legacy mode only?
* Preallocation. Right now, malloc will free any and all memory that it can,
which could lead to a (perhaps counterintuitive) situation where a user
calls DPDK with --socket-mem=1024,1024, does a single "rte_free" and loses
all of the preallocated memory in the process. Would preallocating memory
*and keeping it no matter what* be a valid use case? E.g. if DPDK was run
without any memory requirements specified, grow and shrink as needed, but
DPDK was asked to preallocate memory, we can grow but we can't shrink
past the preallocated amount?
Any other feedback about things i didn't think of or missed is greatly
appreciated.
[1] http://dpdk.org/dev/patchwork/patch/24484/
[2] http://dpdk.org/dev/patchwork/patch/31838/
Hi all,
Could this proposal be discussed at the next tech board meeting?
--
Thanks,
Anatoly