On Mon, 23 May 2022 03:01:20 +0000
Spike Du <spi...@nvidia.com> wrote:

> Hi, pls see below.
> 
> > -----Original Message-----
> > From: Stephen Hemminger <step...@networkplumber.org>
> > Sent: Sunday, May 22, 2022 11:23 PM
> > To: Spike Du <spi...@nvidia.com>
> > Cc: Matan Azrad <ma...@nvidia.com>; Slava Ovsiienko
> > <viachesl...@nvidia.com>; Ori Kam <or...@nvidia.com>; NBU-Contact-
> > Thomas Monjalon (EXTERNAL) <tho...@monjalon.net>; dev@dpdk.org;
> > Raslan Darawsheh <rasl...@nvidia.com>
> > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Sun, 22 May 2022 08:58:56 +0300
> > Spike Du <spi...@nvidia.com> wrote:
> >   
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > 04cff8ee10..687ae5ff29 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > >        */
> > >       union rte_eth_rxseg *rx_seg;
> > >
> > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > +     /**
> > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > +      */
> > > +     uint8_t lwm;
> > > +
> > > +     uint8_t reserved_bits[3];
> > > +     uint32_t reserved_32s;
> > > +     uint64_t reserved_64s;
> > >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> > >  };
> > >  
> > 
> > Ok but, this is an ABI risk about this because reserved stuff was never
> > required before.
> > Whenever is a reserved field is introduced the code (in this case
> > rte_ethdev_configure).
> > 
> > Best practice would have been to have the code require all reserved fields 
> > be
> > 0 in earlier releases. In this case an application is like to define a 
> > watermark of
> > zero; how will your code handle it.  
> Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
> Queue's watermark is not monitored, hence no LWM event is generated.
> > 
> > Also, using 8 bits as percentage is different than how other API's handle 
> > this.
> > Since Rx queue size is in packets, why is this not in packets?  
> The short answer is to simply the LWM configuration.
> Rx queue descriptor is complex nowadays. 
> For normal queue, user may configure LWM according to queue descriptor number 
> easily.
> But for below queues, it's not easy:
> Take mprq as example, the testpmd cmd  options can be " -a 
> 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
> -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
> For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the 
> "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it, 
> because 512 > 16. If you want the watermark to be half, the correct value is 
> 8.
> The same issue happens to feature like "Rx queue buffer split" where a packet 
> can be split to multiple descriptors.
> Using percentage doesn't have such issues, PMD will cover all the details.
> 
> > Also document what behavior of 0 is.  
> Sure. The behavior is like the old days without this feature, pls see above.
> 
> > Why introduce new query/set operations? This should just be part of the
> > overall device configuration.  
> Due to different implementation. LWM can be a dynamic configuration which can 
> help user design a flexible flow control.
> User may feel ok with LWM of 80% to get high throughput, or later on with 50% 
> to throttle the traffic responsively by handling LWM event in order to reduce 
> drop.
> Some driver like mlx5 may implement LWM event as one-time shot. When you 
> receive LWM event, you need to reconfigure LWM in order to receive the event 
> again, thus you will
> not likely to be overwhelmed by the events.
> These all require set operation.
> 
> For the query operation. The rte_event API rte_eth_dev_callback_process() is 
> per-port API, it doesn't carry much information when an event happens.
> When a LWM event happens, we need to know in which Rx queue it happens or 
> optionally what's the current LWM percentage of this queue.
> The query operation serves this purpose.
> 
> 
> Regards,
> Spike.
> 
> 

The bigger question is why does this have to be just MLX5 and why
can't it fit into the existing DPDK RX interrupt framework?

Linux and BSD have had this for years in their packet coalescing logic.
Ethtool provides ability to set lot of irq coalescing parameters like:

       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
              [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
              [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
              [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
              [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
              [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
              [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
              [cqe-mode-rx on|off] [cqe-mode-tx on|off]

It feels like this is just the DPDK version of a small subset of that.
Since many device already support IRQ coalescing, it would be best to build
one new API that has most of these. Rather than a MLX/Nvidia only API for
a single parameter.

Reply via email to