On Wed, 2 Dec 2015 16:48:17 +0000 "Dumitrescu, Cristian" <cristian.dumitrescu at intel.com> wrote:
> > > > -----Original Message----- > > From: Stephen Hemminger [mailto:stephen at networkplumber.org] > > Sent: Sunday, November 29, 2015 8:47 PM > > To: Dumitrescu, Cristian <cristian.dumitrescu at intel.com> > > Cc: dev at dpdk.org; Stephen Hemminger <stephen at networkplumber.org> > > Subject: [PATCH 3/3] rte_sched: eliminate floating point in calculating byte > > clock > > > > The old code was doing a floating point divide for each rte_dequeue() > > which is very expensive. Change to using fixed point scaled inverse > > multiply. To maintain equivalent precision, scaled math is used. > > The application ABI is the same. > > > > This improved performance from 5Gbit/sec to 10 Gbit/sec when configured > > for 10 Gbit/sec rate. > > > > There was some feedback from Cristian that he wanted a better > > solution and was going to give one, but none was provided. > > For 2.2 this is a better solution than existing code, if someone > > has a better version I would love to see it. > > > > Signed-off-by: Stephen Hemminger <stephen at networkplumber.org> > > --- > > lib/librte_sched/rte_sched.c | 23 ++++++++++++++++++----- > > 1 file changed, 18 insertions(+), 5 deletions(-) > > > > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c > > index 16acd6b..cfae136 100644 > > --- a/lib/librte_sched/rte_sched.c > > +++ b/lib/librte_sched/rte_sched.c > > @@ -47,6 +47,7 @@ > > #include "rte_bitmap.h" > > #include "rte_sched_common.h" > > #include "rte_approx.h" > > +#include "rte_reciprocal.h" > > > > #ifdef __INTEL_COMPILER > > #pragma warning(disable:2259) /* conversion may lose significant bits */ > > @@ -62,6 +63,11 @@ > > #define RTE_SCHED_PIPE_INVALID UINT32_MAX > > #define RTE_SCHED_BMP_POS_INVALID UINT32_MAX > > > > +/* Scaling for cycles_per_byte calculation > > + * Chosen so that minimum rate is 480 bit/sec > > + */ > > +#define RTE_SCHED_TIME_SHIFT 8 > > Stephen, can you please elaborate why we need to shift the dividend at all > and why the shift value was picked as 8? Is 8 a hard constraint? How does > this affect the scheduling precision/accuracy? The shift value is a tradeoff for scaled math. The bigger the shift the finer the resolution, but at the risk of overflow in the cycles_per_byte. The value was chosen as a tradeoff based on current CPU clock rate (TSC) and minimum rate. > > + > > struct rte_sched_subport { > > /* Token bucket (TB) */ > > uint64_t tb_time; /* time of last update */ > > @@ -215,7 +221,7 @@ struct rte_sched_port { > > uint64_t time_cpu_cycles; /* Current CPU time measured in CPU > > cyles */ > > uint64_t time_cpu_bytes; /* Current CPU time measured in bytes > > */ > > uint64_t time; /* Current NIC TX time measured in bytes > > */ > > - double cycles_per_byte; /* CPU cycles per byte */ > > + struct rte_reciprocal inv_cycles_per_byte; /* CPU cycles per byte */ > > > > /* Scheduling loop detection */ > > uint32_t pipe_loop; > > @@ -610,7 +616,7 @@ struct rte_sched_port * > > rte_sched_port_config(struct rte_sched_port_params *params) > > { > > struct rte_sched_port *port = NULL; > > - uint32_t mem_size, bmp_mem_size, n_queues_per_port, i; > > + uint32_t mem_size, bmp_mem_size, n_queues_per_port, i, > > cycles_per_byte; > > > > /* Check user parameters. Determine the amount of memory to > > allocate */ > > mem_size = rte_sched_port_get_memory_footprint(params); > > @@ -661,7 +667,10 @@ rte_sched_port_config(struct > > rte_sched_port_params *params) > > port->time_cpu_cycles = rte_get_tsc_cycles(); > > port->time_cpu_bytes = 0; > > port->time = 0; > > - port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) > > params->rate); > > + > > + cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) > > + / params->rate; > > + port->inv_cycles_per_byte = rte_reciprocal_value(cycles_per_byte); > > > > /* Scheduling loop detection */ > > port->pipe_loop = RTE_SCHED_PIPE_INVALID; > > @@ -2088,11 +2097,15 @@ rte_sched_port_time_resync(struct > > rte_sched_port *port) > > { > > uint64_t cycles = rte_get_tsc_cycles(); > > uint64_t cycles_diff = cycles - port->time_cpu_cycles; > > - double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte; > > + uint64_t bytes_diff; > > + > > + /* Compute elapsed time in bytes */ > > + bytes_diff = rte_reciprocal_divide(cycles_diff << > > RTE_SCHED_TIME_SHIFT, > > + port->inv_cycles_per_byte); > > > > /* Advance port time */ > > port->time_cpu_cycles = cycles; > > - port->time_cpu_bytes += (uint64_t) bytes_diff; > > + port->time_cpu_bytes += bytes_diff; > > if (port->time < port->time_cpu_bytes) > > port->time = port->time_cpu_bytes; > > > > -- > > 2.1.4 > > Can you provide some insight into how you tested this code and the test > vectors you used? We tested with 10 gbit link and range of rates from 10k bit up to 10 gbit.