On Sun, 7 Apr 2024 11:36:59 +0200
Mattias Rönnblom <hof...@lysator.liu.se> wrote:

> On 2024-04-06 17:28, Morten Brørup wrote:
> >> From: Tyler Retzlaff [mailto:roret...@linux.microsoft.com]
> >> Sent: Thursday, 4 April 2024 19.15
> >>
> >> RFC sample illustrating simple conversion of VLA to alloca().
> >>
> >> Signed-off-by: Tyler Retzlaff <roret...@linux.microsoft.com>
> >> ---  
> > 
> > [...]
> >   
> >> --- a/lib/latencystats/rte_latencystats.c
> >> +++ b/lib/latencystats/rte_latencystats.c
> >> @@ -159,7 +159,7 @@ struct latency_stats_nameoff {
> >>   {
> >>    unsigned int i, cnt = 0;
> >>    uint64_t now;
> >> -  float latency[nb_pkts];
> >> +  float *latency = alloca(sizeof(float) * nb_pkts);  
> > 
> > In cases where we are processing packet bursts, I would prefer introducing 
> > a global #define RTE_MAX_PKT_BURST_SIZE, indicating the max packet burst 
> > size supported by libraries and drivers.  
> 
> First question: what is meant by a "packet" here? An mbuf? A 
> network-layer PDU? Something that in some way relates to zero or more 
> packets, like an rte_event? Or just any object that are sent or receive 
> of some DPDK API in batches or bursts?
> 
> Second question: is RTE_MAX_PKT_BURST_SIZE meant as an upper bound, so 
> no API can consumer or produce a burst larger than this, it does all 
> APIs literally have to support that burst size.
> 
> Third question: why not just keep VLAs?
> 
> > For reference, rte_config.h already has #define RTE_GRAPH_BURST_SIZE 256.
> > 
> > Such a common define should also be used by functions such as 
> > rte_pktmbuf_free_bulk(); although it also supports segmented packets, so it 
> > must still be able to handle more mbufs.
> > https://elixir.bootlin.com/dpdk/v24.03/source/lib/mbuf/rte_mbuf.c#L486
> >   

Looking at the maths here, calc_lantency can be seriously improved:
   - the calc latency is in the fast path. for transmit.
   - it is doing floating point math; floating point is much slower than doing
     fixed point
   - the latency[] array is a temporary, it should be possible to compute
     total latency without it.
   - it acquires a lock, in order to achieve DPDK level performance of 40 Mpps, 
it is
     necessary to not do absolute minimum of locking.


Reply via email to