[PATCH] vfio: map contiguous areas in one go Test-by : xix.zh...@intel.com Thanks
> -----Original Message----- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of dev- > requ...@dpdk.org > Sent: February 25, 2020 21:50 > To: dev@dpdk.org > Subject: dev Digest, Vol 288, Issue 27 > > Send dev mailing list submissions to > dev@dpdk.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mails.dpdk.org/listinfo/dev > or, via email, send a message with subject or body 'help' to > dev-requ...@dpdk.org > > You can reach the person managing the list at > dev-ow...@dpdk.org > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of dev digest..." > > > Today's Topics: > > 1. Re: [PATCH] doc/mlx5: update mlx5 guide (Thomas Monjalon) > 2. Re: [PATCH] doc: describe the pktmbuf pool with pinned > extarnal memory (Thomas Monjalon) > 3. [PATCH] vfio: map contiguous areas in one go (Anatoly Burakov) > 4. Re: [RFC 0/6] New sync modes for ring (Ananyev, Konstantin) > 5. Re: [PATCH] vfio: map contiguous areas in one go (Ray Kinsella) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 25 Feb 2020 14:19:10 +0100 > From: Thomas Monjalon <tho...@monjalon.net> > To: Viacheslav Ovsiienko <viachesl...@mellanox.com> > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] doc/mlx5: update mlx5 guide > Message-ID: <2125420.IFkqi6BYcA@xps> > Content-Type: text/plain; charset="us-ascii" > > 24/02/2020 18:57, Viacheslav Ovsiienko: > > - metadata limitation is described > > - no inline hint flag is described > > > > Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com> > > --- > > Split in 2 patches and applied, thanks > > > > > > ------------------------------ > > Message: 2 > Date: Tue, 25 Feb 2020 14:22:13 +0100 > From: Thomas Monjalon <tho...@monjalon.net> > To: Viacheslav Ovsiienko <viachesl...@mellanox.com> > Cc: dev@dpdk.org, olivier.m...@6wind.com > Subject: Re: [dpdk-dev] [PATCH] doc: describe the pktmbuf pool with > pinned extarnal memory > Message-ID: <13189717.lVVuGzaMjS@xps> > Content-Type: text/plain; charset="us-ascii" > > 24/02/2020 18:58, Viacheslav Ovsiienko: > > Document the new mbuf pools with external pinned buffers. > > > > Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com> > > --- > > doc/guides/prog_guide/mbuf_lib.rst | 34 > ++++++++++++++++++++++++++++++++++ > > Please submit this documentation separately for review by Olivier. > > > doc/guides/rel_notes/release_20_02.rst | 5 +++++ > > Applying only the release notes in 20.02, thanks > > > > > > ------------------------------ > > Message: 3 > Date: Tue, 25 Feb 2020 13:24:48 +0000 > From: Anatoly Burakov <anatoly.bura...@intel.com> > To: dev@dpdk.org > Subject: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go > Message-ID: > <c6550c8e66dbc7a54a0b495ecda58b0eb79c07ca.1582637079.git.anat > oly.bura...@intel.com> > > > Currently, when we are creating DMA mappings for memory that's either > external or is backed by hugepages in IOVA as PA mode, we assume that > each page is necessarily discontiguous. This may not actually be the > case, especially for external memory, where the user is able to create > their own IOVA table and make it contiguous. This is a problem because > VFIO has a limited number of DMA mappings, and it does not appear to > concatenate them and treats each mapping as separate, even when they > cover adjacent areas. > > Fix this so that we always map contiguous memory in a single chunk, as > opposed to mapping each segment separately. > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > --- > lib/librte_eal/linux/eal/eal_vfio.c | 59 > +++++++++++++++++++++++++---- > 1 file changed, 51 insertions(+), 8 deletions(-) > > diff --git a/lib/librte_eal/linux/eal/eal_vfio.c > b/lib/librte_eal/linux/eal/eal_vfio.c > index 01b5ef3f42..4502aefed3 100644 > --- a/lib/librte_eal/linux/eal/eal_vfio.c > +++ b/lib/librte_eal/linux/eal/eal_vfio.c > @@ -514,9 +514,11 @@ static void > vfio_mem_event_callback(enum rte_mem_event type, const void *addr, > size_t len, > void *arg __rte_unused) > { > + rte_iova_t iova_start, iova_expected; > struct rte_memseg_list *msl; > struct rte_memseg *ms; > size_t cur_len = 0; > + uint64_t va_start; > > msl = rte_mem_virt2memseg_list(addr); > > @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum rte_mem_event type, > const void *addr, size_t len, #endif > /* memsegs are contiguous in memory */ > ms = rte_mem_virt2memseg(addr, msl); > + > + /* > + * This memory is not guaranteed to be contiguous, but it still could > + * be, or it could have some small contiguous chunks. Since the > number > + * of VFIO mappings is limited, and VFIO appears to not concatenate > + * adjacent mappings, we have to do this ourselves. > + * > + * So, find contiguous chunks, then map them. > + */ > + va_start = ms->addr_64; > + iova_start = iova_expected = ms->iova; > while (cur_len < len) { > + bool new_contig_area = ms->iova != iova_expected; > + bool last_seg = (len - cur_len) == ms->len; > + bool skip_last = false; > + > + /* only do mappings when current contiguous area ends */ > + if (new_contig_area) { > + if (type == RTE_MEM_EVENT_ALLOC) > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > + iova_start, > + iova_expected - iova_start, 1); > + else > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > + iova_start, > + iova_expected - iova_start, 0); > + va_start = ms->addr_64; > + iova_start = ms->iova; > + } > /* some memory segments may have invalid IOVA */ > if (ms->iova == RTE_BAD_IOVA) { > RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, > skipping\n", > ms->addr); > - goto next; > + skip_last = true; > } > - if (type == RTE_MEM_EVENT_ALLOC) > - vfio_dma_mem_map(default_vfio_cfg, ms- > >addr_64, > - ms->iova, ms->len, 1); > - else > - vfio_dma_mem_map(default_vfio_cfg, ms- > >addr_64, > - ms->iova, ms->len, 0); > -next: > + iova_expected = ms->iova + ms->len; > cur_len += ms->len; > ++ms; > + > + /* > + * don't count previous segment, and don't attempt to > + * dereference a potentially invalid pointer. > + */ > + if (skip_last && !last_seg) { > + iova_expected = iova_start = ms->iova; > + va_start = ms->addr_64; > + } else if (!skip_last && last_seg) { > + /* this is the last segment and we're not skipping */ > + if (type == RTE_MEM_EVENT_ALLOC) > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > + iova_start, > + iova_expected - iova_start, 1); > + else > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > + iova_start, > + iova_expected - iova_start, 0); > + } > } > #ifdef RTE_ARCH_PPC_64 > cur_len = 0; > -- > 2.17.1 > > > ------------------------------ > > Message: 4 > Date: Tue, 25 Feb 2020 13:41:14 +0000 > From: "Ananyev, Konstantin" <konstantin.anan...@intel.com> > To: Stephen Hemminger <step...@networkplumber.org>, Jerin Jacob > <jerinjac...@gmail.com> > Cc: dpdk-dev <dev@dpdk.org>, Olivier Matz <olivier.m...@6wind.com>, > "d...@linux.vnet.ibm.com" <d...@linux.vnet.ibm.com> > Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring > Message-ID: > <SN6PR11MB255845FB6665C2A2D5D76D229AED0@SN6PR11MB2558. > namprd11.prod.outlook.com> > > Content-Type: text/plain; charset="us-ascii" > > > > > > Upfront note - that RFC is not a complete patch. > > > > > It introduces an ABI breakage, plus it doesn't update > > > > > ring_elem code properly, etc. > > > > > I plan to deal with all these things in later versions. > > > > > Right now I seek an initial feedback about proposed ideas. > > > > > Would also ask people to repeat performance tests (see below) > > > > > on their platforms to confirm the impact. > > > > > > > > > > More and more customers use(/try to use) DPDK based apps > > > > > within overcommitted systems (multiple acttive threads over > > > > > same pysical > cores): > > > > > VM, container deployments, etc. > > > > > One quite common problem they hit: Lock-Holder-Preemption with > rte_ring. > > > > > LHP is quite a common problem for spin-based sync primitives > > > > > (spin-locks, etc.) on overcommitted systems. > > > > > The situation gets much worse when some sort of fair-locking > > > > > technique is used (ticket-lock, etc.). > > > > > As now not only lock-owner but also lock-waiters scheduling > > > > > order matters a lot. > > > > > This is a well-known problem for kernel within VMs: > > > > > http://www- > archive.xenproject.org/files/xensummitboston08/LHP.pdf > > > > > https://www.cs.hs- > rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf > > > > > The problem with rte_ring is that while head accusion is sort > > > > > of un-fair locking, waiting on tail is very similar to ticket > > > > > lock schema - tail has to be updated in particular order. > > > > > That makes current rte_ring implementation to perform really > > > > > pure on some overcommited scenarios. > > > > > > > > Rather than reform rte_ring to fit this scenario, it would make > > > > more sense to me to introduce another primitive. > > I don't see much advantages it will bring us. > As a disadvantages, for developers and maintainers - code duplication, > for end users - extra code churn and removed ability to mix and match > different sync modes in one ring. > > > The current lockless > > > > ring performs very well for the isolated thread model that DPDK > > > > was built around. This looks like a case of customers violating > > > > the usage model of the DPDK and then being surprised at the fallout. > > For customers using isolated thread model - nothing should change > (both in terms of API and performance). > Existing sync modes MP/MC,SP/SC kept untouched, set up in the same way > (via flags and _init_), and MP/MC remains as default one. > >From other side I don't see why we should ignore customers that want > >to > use > their DPDK apps in different deployment scenarios. > > > > > > > I agree with Stephen here. > > > > > > I think, adding more runtime check in the enqueue() and dequeue() > > > will have a bad effect on the low-end cores too. > > We do have a run-time check in our current enqueue()/dequeue > implementation. > In fact we support both modes: we have generic > rte_ring_enqueue(/dequeue)_bulk(/burst) > where sync behaviour is determined at runtime by value of > prod(/cons).single. > Or user can call rte_ring_(mp/sp)_enqueue_* functions directly. > This RFC follows exactly the same paradigm: > rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's > behaviour is determined at runtime, by value of prod(/cons).sync_type. > Or user can call enqueue/dequeue with particular sync mode directly: > rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*. > The only thing that changed: > Format of prod/cons now could differ depending on mode selected at _init_. > So you can't create a ring for let say SP mode and then in the middle > of data- path change your mind and start using MP_RTS mode. > For existing modes (SP/MP, SC/MC) format remains the same and user > can still use them interchangeably, though of course that is an error > prone practice. > > > > But I agree with the problem statement that in the virtualization > > > use case, It may be possible to have N virtual cores runs on a > > > physical core. > > > > > > IMO, The best solution would be keeping the ring API same and have > > > a different flavor in "compile-time". Something like liburcu did > > > for accommodating different flavors. > > > > > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. > > > The application can simply include ONE header file in a C file > > > based on the flavor. > > I don't think it is a flexible enough approach. > In one app user might need to have several rings with different sync modes. > Or even user might need a ring with different sync modes for > enqueue/dequeue. > > > > If need both at runtime. Need to have function pointer or so in > > > the application and define the function in different c file by > > > including the approaite flavor in C file. > > Big issue with function pointers here would be DPDK MP model. > AFAIK, rte_ring is quite popular mechanism for IPC between DPDK apps. > To support such model, we'll need to split rte_ring data into 'shared' > and 'private' and initialize private one for every process that is going to > use it. > That sounds like a massive change, and I am not sure the required > effort will worth it. > BTW, if user just calls API functions without trying to access > structure internals directly, I don't think it would be a big > difference for him what is inside: > indirect function call or inlined switch(...) {}. > > > This would also be a good time to consider the tradeoffs of the > > heavy use of inlining that is done in rte_ring vs the impact that > > has on API/ABI stability. > > Yes, hiding rte_ring implementation inside .c would help a lot in > terms of ABI maintenance and would make our future life easier. > The question is what is the price for it in terms of performance, and > are we ready to pay it. Not to mention that it would cause changes in > many other libs/apps... > So I think it should be a subject for a separate discussion. > But, agree it would be good at least to measure the performance impact > of such change. > If I'll have some spare cycles, will give it a try. > Meanwhile, can I ask Jerin and other guys to repeat tests from this > RFC on their HW? Before continuing discussion would probably be good > to know does the suggested patch work as expected across different platforms. > Thanks > Konstantin > > > ------------------------------ > > Message: 5 > Date: Tue, 25 Feb 2020 13:49:38 +0000 > From: Ray Kinsella <m...@ashroe.eu> > To: Anatoly Burakov <anatoly.bura...@intel.com>, dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go > Message-ID: <276c85ed-ac2f-52c9-dffc-18ce41ab0...@ashroe.eu> > Content-Type: text/plain; charset=utf-8 > > Hi Anatoly, > > On 25/02/2020 13:24, Anatoly Burakov wrote: > > Currently, when we are creating DMA mappings for memory that's > > either external or is backed by hugepages in IOVA as PA mode, we > > assume that each page is necessarily discontiguous. This may not > > actually be the case, especially for external memory, where the user > > is able to create their own IOVA table and make it contiguous. This > > is a problem because VFIO has a limited number of DMA mappings, and > > it does not appear to concatenate them and treats each mapping as > > separate, even when they cover adjacent areas. > > > Fix this so that we always map contiguous memory in a single > > chunk, as opposed to mapping each segment separately. > > Can I confirm my understanding. > > We are essentially correcting user errant behavior, trading off > startup/mapping time to save IOMMU resources? > > > > > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > > --- > > lib/librte_eal/linux/eal/eal_vfio.c | 59 > > +++++++++++++++++++++++++---- > > 1 file changed, 51 insertions(+), 8 deletions(-) > > > > diff --git a/lib/librte_eal/linux/eal/eal_vfio.c > b/lib/librte_eal/linux/eal/eal_vfio.c > > index 01b5ef3f42..4502aefed3 100644 > > --- a/lib/librte_eal/linux/eal/eal_vfio.c > > +++ b/lib/librte_eal/linux/eal/eal_vfio.c > > @@ -514,9 +514,11 @@ static void > > vfio_mem_event_callback(enum rte_mem_event type, const void *addr, > size_t len, > > void *arg __rte_unused) > > { > > + rte_iova_t iova_start, iova_expected; > > struct rte_memseg_list *msl; > > struct rte_memseg *ms; > > size_t cur_len = 0; > > + uint64_t va_start; > > > > msl = rte_mem_virt2memseg_list(addr); > > > > @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum > rte_mem_event type, const void *addr, size_t len, > > #endif > > /* memsegs are contiguous in memory */ > > ms = rte_mem_virt2memseg(addr, msl); > > + > > + /* > > + * This memory is not guaranteed to be contiguous, but it still could > > + * be, or it could have some small contiguous chunks. Since the > number > > + * of VFIO mappings is limited, and VFIO appears to not concatenate > > + * adjacent mappings, we have to do this ourselves. > > + * > > + * So, find contiguous chunks, then map them. > > + */ > > + va_start = ms->addr_64; > > + iova_start = iova_expected = ms->iova; > > while (cur_len < len) { > > + bool new_contig_area = ms->iova != iova_expected; > > + bool last_seg = (len - cur_len) == ms->len; > > + bool skip_last = false; > > + > > + /* only do mappings when current contiguous area ends */ > > + if (new_contig_area) { > > + if (type == RTE_MEM_EVENT_ALLOC) > > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > > + iova_start, > > + iova_expected - iova_start, 1); > > + else > > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > > + iova_start, > > + iova_expected - iova_start, 0); > > + va_start = ms->addr_64; > > + iova_start = ms->iova; > > + } > > /* some memory segments may have invalid IOVA */ > > if (ms->iova == RTE_BAD_IOVA) { > > RTE_LOG(DEBUG, EAL, "Memory segment at %p has > bad IOVA, skipping\n", > > ms->addr); > > - goto next; > > + skip_last = true; > > } > > - if (type == RTE_MEM_EVENT_ALLOC) > > - vfio_dma_mem_map(default_vfio_cfg, ms- > >addr_64, > > - ms->iova, ms->len, 1); > > - else > > - vfio_dma_mem_map(default_vfio_cfg, ms- > >addr_64, > > - ms->iova, ms->len, 0); > > -next: > > + iova_expected = ms->iova + ms->len; > > cur_len += ms->len; > > ++ms; > > + > > + /* > > + * don't count previous segment, and don't attempt to > > + * dereference a potentially invalid pointer. > > + */ > > + if (skip_last && !last_seg) { > > + iova_expected = iova_start = ms->iova; > > + va_start = ms->addr_64; > > + } else if (!skip_last && last_seg) { > > + /* this is the last segment and we're not skipping */ > > + if (type == RTE_MEM_EVENT_ALLOC) > > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > > + iova_start, > > + iova_expected - iova_start, 1); > > + else > > + vfio_dma_mem_map(default_vfio_cfg, > va_start, > > + iova_start, > > + iova_expected - iova_start, 0); > > + } > > } > > #ifdef RTE_ARCH_PPC_64 > > cur_len = 0; > > > > > End of dev Digest, Vol 288, Issue 27 > ************************************