Re: [dpdk-dev] dev Digest, Vol 288, Issue 27

Zhang, XiX Mon, 09 Mar 2020 19:56:24 -0700

[PATCH] vfio: map contiguous areas in one go
Test-by : xix.zh...@intel.com
Thanks


> -----Original Message-----
> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of dev- 
> requ...@dpdk.org
> Sent: February 25, 2020 21:50
> To: dev@dpdk.org
> Subject: dev Digest, Vol 288, Issue 27
> 
> Send dev mailing list submissions to
>       dev@dpdk.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://mails.dpdk.org/listinfo/dev
> or, via email, send a message with subject or body 'help' to
>       dev-requ...@dpdk.org
> 
> You can reach the person managing the list at
>       dev-ow...@dpdk.org
> 
> When replying, please edit your Subject line so it is more specific than "Re:
> Contents of dev digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: [PATCH] doc/mlx5: update mlx5 guide (Thomas Monjalon)
>    2. Re: [PATCH] doc: describe the pktmbuf pool with pinned
>       extarnal memory (Thomas Monjalon)
>    3. [PATCH] vfio: map contiguous areas in one go (Anatoly Burakov)
>    4. Re: [RFC 0/6] New sync modes for ring (Ananyev, Konstantin)
>    5. Re: [PATCH] vfio: map contiguous areas in one go (Ray Kinsella)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 25 Feb 2020 14:19:10 +0100
> From: Thomas Monjalon <tho...@monjalon.net>
> To: Viacheslav Ovsiienko <viachesl...@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] doc/mlx5: update mlx5 guide
> Message-ID: <2125420.IFkqi6BYcA@xps>
> Content-Type: text/plain; charset="us-ascii"
> 
> 24/02/2020 18:57, Viacheslav Ovsiienko:
> > - metadata limitation is described
> > - no inline hint flag is described
> >
> > Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com>
> > ---
> 
> Split in 2 patches and applied, thanks
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 25 Feb 2020 14:22:13 +0100
> From: Thomas Monjalon <tho...@monjalon.net>
> To: Viacheslav Ovsiienko <viachesl...@mellanox.com>
> Cc: dev@dpdk.org, olivier.m...@6wind.com
> Subject: Re: [dpdk-dev] [PATCH] doc: describe the pktmbuf pool with
>       pinned  extarnal memory
> Message-ID: <13189717.lVVuGzaMjS@xps>
> Content-Type: text/plain; charset="us-ascii"
> 
> 24/02/2020 18:58, Viacheslav Ovsiienko:
> > Document the new mbuf pools with external pinned buffers.
> >
> > Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com>
> > ---
> >  doc/guides/prog_guide/mbuf_lib.rst     | 34
> ++++++++++++++++++++++++++++++++++
> 
> Please submit this documentation separately for review by Olivier.
> 
> >  doc/guides/rel_notes/release_20_02.rst |  5 +++++
> 
> Applying only the release notes in 20.02, thanks
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 25 Feb 2020 13:24:48 +0000
> From: Anatoly Burakov <anatoly.bura...@intel.com>
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID:
>       <c6550c8e66dbc7a54a0b495ecda58b0eb79c07ca.1582637079.git.anat
> oly.bura...@intel.com>
> 
> 
> Currently, when we are creating DMA mappings for memory that's either 
> external or is backed by hugepages in IOVA as PA mode, we assume that 
> each page is necessarily discontiguous. This may not actually be the 
> case, especially for external memory, where the user is able to create 
> their own IOVA table and make it contiguous. This is a problem because 
> VFIO has a limited number of DMA mappings, and it does not appear to 
> concatenate them and treats each mapping as separate, even when they 
> cover adjacent areas.
> 
> Fix this so that we always map contiguous memory in a single chunk, as 
> opposed to mapping each segment separately.
> 
> Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
> ---
>  lib/librte_eal/linux/eal/eal_vfio.c | 59 
> +++++++++++++++++++++++++----
>  1 file changed, 51 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> index 01b5ef3f42..4502aefed3 100644
> --- a/lib/librte_eal/linux/eal/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> @@ -514,9 +514,11 @@ static void
>  vfio_mem_event_callback(enum rte_mem_event type, const void *addr, 
> size_t len,
>               void *arg __rte_unused)
>  {
> +     rte_iova_t iova_start, iova_expected;
>       struct rte_memseg_list *msl;
>       struct rte_memseg *ms;
>       size_t cur_len = 0;
> +     uint64_t va_start;
> 
>       msl = rte_mem_virt2memseg_list(addr);
> 
> @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum rte_mem_event type, 
> const void *addr, size_t len,  #endif
>       /* memsegs are contiguous in memory */
>       ms = rte_mem_virt2memseg(addr, msl);
> +
> +     /*
> +      * This memory is not guaranteed to be contiguous, but it still could
> +      * be, or it could have some small contiguous chunks. Since the
> number
> +      * of VFIO mappings is limited, and VFIO appears to not concatenate
> +      * adjacent mappings, we have to do this ourselves.
> +      *
> +      * So, find contiguous chunks, then map them.
> +      */
> +     va_start = ms->addr_64;
> +     iova_start = iova_expected = ms->iova;
>       while (cur_len < len) {
> +             bool new_contig_area = ms->iova != iova_expected;
> +             bool last_seg = (len - cur_len) == ms->len;
> +             bool skip_last = false;
> +
> +             /* only do mappings when current contiguous area ends */
> +             if (new_contig_area) {
> +                     if (type == RTE_MEM_EVENT_ALLOC)
> +                             vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +                                             iova_start,
> +                                             iova_expected - iova_start, 1);
> +                     else
> +                             vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +                                             iova_start,
> +                                             iova_expected - iova_start, 0);
> +                     va_start = ms->addr_64;
> +                     iova_start = ms->iova;
> +             }
>               /* some memory segments may have invalid IOVA */
>               if (ms->iova == RTE_BAD_IOVA) {
>                       RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, 
> skipping\n",
>                                       ms->addr);
> -                     goto next;
> +                     skip_last = true;
>               }
> -             if (type == RTE_MEM_EVENT_ALLOC)
> -                     vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> -                                     ms->iova, ms->len, 1);
> -             else
> -                     vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> -                                     ms->iova, ms->len, 0);
> -next:
> +             iova_expected = ms->iova + ms->len;
>               cur_len += ms->len;
>               ++ms;
> +
> +             /*
> +              * don't count previous segment, and don't attempt to
> +              * dereference a potentially invalid pointer.
> +              */
> +             if (skip_last && !last_seg) {
> +                     iova_expected = iova_start = ms->iova;
> +                     va_start = ms->addr_64;
> +             } else if (!skip_last && last_seg) {
> +                     /* this is the last segment and we're not skipping */
> +                     if (type == RTE_MEM_EVENT_ALLOC)
> +                             vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +                                             iova_start,
> +                                             iova_expected - iova_start, 1);
> +                     else
> +                             vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +                                             iova_start,
> +                                             iova_expected - iova_start, 0);
> +             }
>       }
>  #ifdef RTE_ARCH_PPC_64
>       cur_len = 0;
> --
> 2.17.1
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 25 Feb 2020 13:41:14 +0000
> From: "Ananyev, Konstantin" <konstantin.anan...@intel.com>
> To: Stephen Hemminger <step...@networkplumber.org>, Jerin Jacob
>       <jerinjac...@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>, Olivier Matz <olivier.m...@6wind.com>,
>       "d...@linux.vnet.ibm.com" <d...@linux.vnet.ibm.com>
> Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> Message-ID:
>       <SN6PR11MB255845FB6665C2A2D5D76D229AED0@SN6PR11MB2558.
> namprd11.prod.outlook.com>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> > > > > Upfront note - that RFC is not a complete patch.
> > > > > It introduces an ABI breakage, plus it doesn't update 
> > > > > ring_elem code properly, etc.
> > > > > I plan to deal with all these things in later versions.
> > > > > Right now I seek an initial feedback about proposed ideas.
> > > > > Would also ask people to repeat performance tests (see below) 
> > > > > on their platforms to confirm the impact.
> > > > >
> > > > > More and more customers use(/try to use) DPDK based apps 
> > > > > within overcommitted systems (multiple acttive threads over 
> > > > > same pysical
> cores):
> > > > > VM, container deployments, etc.
> > > > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > > > LHP is quite a common problem for spin-based sync primitives 
> > > > > (spin-locks, etc.) on overcommitted systems.
> > > > > The situation gets much worse when some sort of fair-locking 
> > > > > technique is used (ticket-lock, etc.).
> > > > > As now not only lock-owner but also lock-waiters scheduling 
> > > > > order matters a lot.
> > > > > This is a well-known problem for kernel within VMs:
> > > > > http://www-
> archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > > https://www.cs.hs-
> rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > > The problem with rte_ring is that while head accusion is sort 
> > > > > of un-fair locking, waiting on tail is very similar to ticket 
> > > > > lock schema - tail has to be updated in particular order.
> > > > > That makes current rte_ring implementation to perform really 
> > > > > pure on some overcommited scenarios.
> > > >
> > > > Rather than reform rte_ring to fit this scenario, it would make 
> > > > more sense to me to introduce another primitive.
> 
> I don't see much advantages it will bring us.
> As a disadvantages, for developers and maintainers - code duplication, 
> for end users - extra code churn and removed ability to mix and match 
> different sync modes in one ring.
> 
> > The current lockless
> > > > ring performs very well for the isolated thread model that DPDK 
> > > > was built around. This looks like a case of customers violating 
> > > > the usage model of the DPDK and then being surprised at the fallout.
> 
> For customers using isolated thread model - nothing should change 
> (both in terms of API and performance).
> Existing sync modes MP/MC,SP/SC kept untouched, set up in the same way 
> (via flags and _init_), and MP/MC remains as default one.
> >From other side I don't see why we should ignore customers that want 
> >to
> use
> their DPDK apps in different deployment scenarios.
> 
> > >
> > > I agree with Stephen here.
> > >
> > > I think, adding more runtime check in the enqueue() and dequeue() 
> > > will have a bad effect on the low-end cores too.
> 
> We do have a run-time check in our current enqueue()/dequeue 
> implementation.
> In fact we support both modes: we have generic
> rte_ring_enqueue(/dequeue)_bulk(/burst)
> where sync behaviour is determined at runtime by value of 
> prod(/cons).single.
> Or user can call  rte_ring_(mp/sp)_enqueue_* functions directly.
> This RFC follows exactly the same paradigm:
> rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's 
> behaviour is determined at runtime, by value of prod(/cons).sync_type.
> Or user can call enqueue/dequeue with particular sync mode directly:
> rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
> The only thing that changed:
>  Format of prod/cons now could differ depending on mode selected at _init_.
>  So you can't create a ring for let say SP mode and then in the middle 
> of data- path  change your mind and start using MP_RTS mode.
>  For existing modes (SP/MP, SC/MC)  format remains the same and user 
> can still  use them interchangeably, though of course that is an error 
> prone practice.
> 
> > > But I agree with the problem statement that in the virtualization 
> > > use case, It may be possible to have N virtual cores runs on a 
> > > physical core.
> > >
> > > IMO, The best solution would be keeping the ring API same and have 
> > > a different flavor in "compile-time". Something like liburcu did 
> > > for accommodating different flavors.
> > >
> > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. 
> > > The application can simply include ONE header file in a C file 
> > > based on the flavor.
> 
> I don't think it is a flexible enough approach.
> In one app user might need to have several rings with different sync modes.
> Or even user might need a ring with different sync modes for 
> enqueue/dequeue.
> 
> > > If need both at runtime. Need to have function pointer or so in 
> > > the application and define the function in different c file by 
> > > including the approaite flavor in C file.
> 
> Big issue with function pointers here would be DPDK MP model.
> AFAIK,  rte_ring is quite popular mechanism for IPC between DPDK apps.
> To support such model, we'll need to split rte_ring data into 'shared'
> and 'private' and initialize private one for every process that is going to 
> use it.
> That sounds like a massive change, and I am not sure the required 
> effort will worth it.
> BTW, if user just calls API functions without trying to access 
> structure internals directly, I don't think it would be a big 
> difference for him what is inside:
> indirect function call or inlined switch(...) {}.
> 
> > This would also be a good time to consider the tradeoffs of the 
> > heavy use of inlining that is done in rte_ring vs the impact that 
> > has on API/ABI stability.
> 
> Yes, hiding rte_ring implementation inside .c would help a lot in 
> terms of ABI maintenance and would make our future life easier.
> The question is what is the price for it in terms of performance, and 
> are we ready to pay it. Not to mention that it would cause changes in 
> many other libs/apps...
> So I think it should be a subject for a separate discussion.
> But, agree it would be good at least to measure the performance impact 
> of such change.
> If I'll have some spare cycles, will give it a try.
> Meanwhile, can I ask Jerin and other guys to repeat tests from this 
> RFC on their HW? Before continuing discussion would probably be good 
> to know does the suggested patch work as expected across different platforms.
> Thanks
> Konstantin
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 25 Feb 2020 13:49:38 +0000
> From: Ray Kinsella <m...@ashroe.eu>
> To: Anatoly Burakov <anatoly.bura...@intel.com>, dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID: <276c85ed-ac2f-52c9-dffc-18ce41ab0...@ashroe.eu>
> Content-Type: text/plain; charset=utf-8
> 
> Hi Anatoly,
> 
> On 25/02/2020 13:24, Anatoly Burakov wrote:
> > Currently, when we are creating DMA mappings for memory that's 
> > either external or is backed by hugepages in IOVA as PA mode, we 
> > assume that each page is necessarily discontiguous. This may not 
> > actually be the case, especially for external memory, where the user 
> > is able to create their own IOVA table and make it contiguous. This 
> > is a problem because VFIO has a limited number of DMA mappings, and 
> > it does not appear to concatenate them and treats each mapping as 
> > separate, even when they cover adjacent areas.
> > > Fix this so that we always map contiguous memory in a single
> > chunk, as opposed to mapping each segment separately.
> 
> Can I confirm my understanding.
> 
> We are essentially correcting user errant behavior, trading off 
> startup/mapping time to save IOMMU resources?
> 
> >
> > Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
> > ---
> >  lib/librte_eal/linux/eal/eal_vfio.c | 59 
> > +++++++++++++++++++++++++----
> >  1 file changed, 51 insertions(+), 8 deletions(-)
> >
> > diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> > index 01b5ef3f42..4502aefed3 100644
> > --- a/lib/librte_eal/linux/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> > @@ -514,9 +514,11 @@ static void
> >  vfio_mem_event_callback(enum rte_mem_event type, const void *addr,
> size_t len,
> >             void *arg __rte_unused)
> >  {
> > +   rte_iova_t iova_start, iova_expected;
> >     struct rte_memseg_list *msl;
> >     struct rte_memseg *ms;
> >     size_t cur_len = 0;
> > +   uint64_t va_start;
> >
> >     msl = rte_mem_virt2memseg_list(addr);
> >
> > @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum
> rte_mem_event type, const void *addr, size_t len,
> >  #endif
> >     /* memsegs are contiguous in memory */
> >     ms = rte_mem_virt2memseg(addr, msl);
> > +
> > +   /*
> > +    * This memory is not guaranteed to be contiguous, but it still could
> > +    * be, or it could have some small contiguous chunks. Since the
> number
> > +    * of VFIO mappings is limited, and VFIO appears to not concatenate
> > +    * adjacent mappings, we have to do this ourselves.
> > +    *
> > +    * So, find contiguous chunks, then map them.
> > +    */
> > +   va_start = ms->addr_64;
> > +   iova_start = iova_expected = ms->iova;
> >     while (cur_len < len) {
> > +           bool new_contig_area = ms->iova != iova_expected;
> > +           bool last_seg = (len - cur_len) == ms->len;
> > +           bool skip_last = false;
> > +
> > +           /* only do mappings when current contiguous area ends */
> > +           if (new_contig_area) {
> > +                   if (type == RTE_MEM_EVENT_ALLOC)
> > +                           vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +                                           iova_start,
> > +                                           iova_expected - iova_start, 1);
> > +                   else
> > +                           vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +                                           iova_start,
> > +                                           iova_expected - iova_start, 0);
> > +                   va_start = ms->addr_64;
> > +                   iova_start = ms->iova;
> > +           }
> >             /* some memory segments may have invalid IOVA */
> >             if (ms->iova == RTE_BAD_IOVA) {
> >                     RTE_LOG(DEBUG, EAL, "Memory segment at %p has
> bad IOVA, skipping\n",
> >                                     ms->addr);
> > -                   goto next;
> > +                   skip_last = true;
> >             }
> > -           if (type == RTE_MEM_EVENT_ALLOC)
> > -                   vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > -                                   ms->iova, ms->len, 1);
> > -           else
> > -                   vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > -                                   ms->iova, ms->len, 0);
> > -next:
> > +           iova_expected = ms->iova + ms->len;
> >             cur_len += ms->len;
> >             ++ms;
> > +
> > +           /*
> > +            * don't count previous segment, and don't attempt to
> > +            * dereference a potentially invalid pointer.
> > +            */
> > +           if (skip_last && !last_seg) {
> > +                   iova_expected = iova_start = ms->iova;
> > +                   va_start = ms->addr_64;
> > +           } else if (!skip_last && last_seg) {
> > +                   /* this is the last segment and we're not skipping */
> > +                   if (type == RTE_MEM_EVENT_ALLOC)
> > +                           vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +                                           iova_start,
> > +                                           iova_expected - iova_start, 1);
> > +                   else
> > +                           vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +                                           iova_start,
> > +                                           iova_expected - iova_start, 0);
> > +           }
> >     }
> >  #ifdef RTE_ARCH_PPC_64
> >     cur_len = 0;
> >
> 
> 
> End of dev Digest, Vol 288, Issue 27
> ************************************

Re: [dpdk-dev] dev Digest, Vol 288, Issue 27

Reply via email to