Re: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

2016-12-07 Thread Alan Robertson
Hi Cristian,

Looking at points 10 and 11 it's good to hear nodes can be dynamically added.

We've been trying to decide the best way to do this for support of qos on 
tunnels for
some time now and the existing implementation doesn't allow this so effectively 
ruled
out hierarchical queueing for tunnel targets on the output interface.

Having said that, has thought been given to separating the queueing from being 
so closely
tied to the Ethernet transmit process ?   When queueing on a tunnel for example 
we may
be working with encryption.   When running with an anti-reply window it is 
really much
better to do the QOS (packet reordering) before the encryption.  To support 
this would
it be possible to have a separate scheduler structure which can be passed into 
the
scheduling API ?  This means the calling code can hang the structure of 
whatever entity
it wishes to perform qos on, and we get dynamic target support 
(sessions/tunnels etc).

Regarding the structure allocation, would it be possible to make the number of 
queues
associated with a TC a compile time option which the scheduler would 
accommodate ?
We frequently only use one queue per tc which means 75% of the space allocated 
at
the queueing layer for that tc is never used.  This may be specific to our 
implementation
but if other implementations do the same if folks could say we may get a better 
idea
if this is a common case.

Whilst touching on the scheduler, the token replenishment works using a 
division and
multiplication obviously to cater for the fact that it may be run after several 
tc windows
have passed.  The most commonly used industrial scheduler simply does a lapsed 
on the tc
and then adds the bc.   This relies on the scheduler being called within the tc 
window
though.  It would be nice to have this as a configurable option since it's much 
for efficient
assuming the infra code from which it's called can guarantee the calling 
frequency.

I hope you'll consider these points for inclusion into a future road map.  
Hopefully in the
future my employer will increase the priority of some of the tasks and a PR may 
appear
on the mailing list.

Thanks,
Alan.

Subject:

[dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Date:

Wed, 30 Nov 2016 18:16:50 +

From:

Cristian Dumitrescu 


To:

dev@dpdk.org

CC:

cristian.dumitre...@intel.com



This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)

hierarchical scheduler. The goal of the abstraction layer is to provide a simple

generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex

implementation.



Q1: What is the benefit for having an abstraction layer for QoS hierarchical

layer?

A1: There is growing interest in the industry for handling various HW-based,

SW-based or mixed hierarchical scheduler implementations using a unified DPDK

API.



Q2: Which devices are targeted by this abstraction layer?

A2: All current and future devices that expose a hierarchical scheduler feature

under DPDK, including NICs, FPGAs, ASICs, SOCs, SW libraries.



Q3: Which scheduler hierarchies are supported by the API?

A3: Hopefully any scheduler hierarchy can be described and covered by the

current API. Of course, functional correctness, accuracy and performance levels

depend on the specific implementations of this API.



Q4: Why have this abstraction layer into ethdev as opposed to a new type of

device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?

A4: Packets are sent to the Ethernet device using the ethdev API

rte_eth_tx_burst() function, with the hierarchical scheduling taking place

automatically (i.e. no SW intervention) in HW implementations. Basically, the

hierarchical scheduler is done as part of packet TX operation.

The hierarchical scheduler is typically the last stage before packet TX and it

is tightly integrated with the TX stage. The hierarchical scheduler is just

another offload feature of the Ethernet device, which needs to be accommodated

by the ethdev API similar to any other offload feature (such as RSS, DCB,

flow director, etc).

Once the decision to schedule a specific packet has been taken, this packet

cannot be dropped and it has to be sent over the wire as is, otherwise what

takes place on the wire is not what was planned at scheduling time, so the

scheduling is not accurate (Note: there are some devices which allow prepending

headers to the packet after the scheduling stage at the expense of sending

correction requests back to the scheduler, but this only strengthens the bond

between scheduling and TX).



Q5: Given that the packet scheduling takes place automatically for pure HW

implementations, how does packet scheduling take place for poll-mode SW

implementations?

A5: The API provided function rte_sched_run() is designed to take care of this.

For HW implementations, t

Re: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

2016-12-08 Thread Alan Robertson
Hi Cristian,

The way qos works just now should be feasible for dynamic targets.   That is 
similar functions
to rte_sched_port_enqueue() and rte_sched_port_dequeue() would be called.  The 
first to
enqueue the mbufs onto the queues the second to dequeue.  The qos structures 
and scheduler
don't need to be as functionally rich though.  I would have thought a simple 
pipe with child
nodes should suffice for most.  That would allow each tunnel/session to be 
shaped and the
queueing and drop logic inherited from what is there just now.

Thanks,
Alan.

-Original Message-
From: Dumitrescu, Cristian [mailto:cristian.dumitre...@intel.com] 
Sent: Wednesday, December 07, 2016 7:52 PM
To: Alan Robertson
Cc: dev@dpdk.org; Thomas Monjalon
Subject: RE: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical 
scheduler

Hi Alan,

Thanks for your comments!


> Hi Cristian,

> Looking at points 10 and 11 it's good to hear nodes can be dynamically added.

Yes, many implementations allow on-the-fly remapping a node from one parent to 
another one, or simply adding more nodes post-initialization, so it is natural 
for the API to provide this.


> We've been trying to decide the best way to do this for support of qos 
> on tunnels for some time now and the existing implementation doesn't 
> allow this so effectively ruled out hierarchical queueing for tunnel targets 
> on the output interface.

> Having said that, has thought been given to separating the queueing from 
> being so closely
> tied to the Ethernet transmit process ?   When queueing on a tunnel for 
> example we may
> be working with encryption.   When running with an anti-reply window it is 
> really much
> better to do the QOS (packet reordering) before the encryption.  To 
> support this would it be possible to have a separate scheduler 
> structure which can be passed into the scheduling API ?  This means 
> the calling code can hang the structure of whatever entity it wishes to 
> perform qos on, and we get dynamic target support (sessions/tunnels etc).

Yes, this is one point where we need to look for a better solution. Current 
proposal attaches the hierarchical scheduler function to an ethdev, so 
scheduling traffic for tunnels that have a pre-defined bandwidth is not 
supported nicely. This question was also raised in VPP, but there tunnels are 
supported as a type of output interfaces, so attaching scheduling to an output 
interface also covers the tunnels case.

Looks to me that nice tunnel abstractions are a gap in DPDK as well. Any 
thoughts about how tunnels should be supported in DPDK? What do other people 
think about this?


> Regarding the structure allocation, would it be possible to make the 
> number of queues associated with a TC a compile time option which the 
> scheduler would accommodate ?
> We frequently only use one queue per tc which means 75% of the space 
> allocated at the queueing layer for that tc is never used.  This may 
> be specific to our implementation but if other implementations do the 
> same if folks could say we may get a better idea if this is a common case.

> Whilst touching on the scheduler, the token replenishment works using 
> a division and multiplication obviously to cater for the fact that it 
> may be run after several tc windows have passed.  The most commonly 
> used industrial scheduler simply does a lapsed on the tc and then adds 
> the bc.   This relies on the scheduler being called within the tc 
> window though.  It would be nice to have this as a configurable option since 
> it's much for efficient assuming the infra code from which it's called can 
> guarantee the calling frequency.

This is probably feedback for librte_sched as opposed to the current API 
proposal, as the Latter is intended to be generic/implementation-agnostic and 
therefor its scope far exceeds the existing set of librte_sched features.

Btw, we do plan using the librte_sched feature as the default fall-back when 
the HW ethdev is not scheduler-enabled, as well as the implementation of choice 
for a lot of use-cases where it fits really well, so we do have to continue 
evolve and improve librte_sched feature-wise and performance-wise.


> I hope you'll consider these points for inclusion into a future road 
> map.  Hopefully in the future my employer will increase the priority 
> of some of the tasks and a PR may appear on the mailing list.

> Thanks,
> Alan.



Re: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

2016-12-09 Thread Alan Robertson
Hi Cristian,

No, it'll be done as a completely separate scheduling mechanism.  We'd allocate 
a much smaller
footprint equivalent to a pipe, TCs and queues.   This structure would be 
completely independent.
It would be up to the calling code to allocate, track and free it so it could 
be associated with any
target.  The equivalent of the enqueue and dequeue functions would be called 
wherever it
was required in the data path.  So if we look at an encrypted tunnel:

Ip forward -> qos enq/qos deq -> encrypt -> port forward (possibly qos again at 
port)

So each structure would work independently with the assumption that it's called 
frequently
enough to keep the state machine ticking over.  Pretty much as we do for a PMD 
scheduler.

Note that if we run the features in the above order encrypted frames aren't 
dropped by the
Qos enqueue.  Since encryption is probably the most expensive processing done 
on a packet it
should give a big performance gain.

Thanks,
Alan.

-Original Message-
From: Dumitrescu, Cristian [mailto:cristian.dumitre...@intel.com] 
Sent: Thursday, December 08, 2016 5:18 PM
To: Alan Robertson
Cc: dev@dpdk.org; Thomas Monjalon
Subject: RE: [dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical 
scheduler

> Hi Cristian,
> 
> The way qos works just now should be feasible for dynamic targets.   That is
> similar functions
> to rte_sched_port_enqueue() and rte_sched_port_dequeue() would be 
> called.  The first to enqueue the mbufs onto the queues the second to 
> dequeue.  The qos structures and scheduler don't need to be as 
> functionally rich though.  I would have thought a simple pipe with 
> child nodes should suffice for most.  That would allow each 
> tunnel/session to be shaped and the queueing and drop logic inherited 
> from what is there just now.
> 
> Thanks,
> Alan.

Hi Alan,

So just to make sure I get this right: you suggest that tunnels/sessions could 
simply be mapped as one of the layers under the port hierarchy?

Thanks,
Cristian



Re: [dpdk-dev] [PATCH] sched: add 64-bit counter retrieval API

2018-07-23 Thread Alan Robertson
Looks good to me.

Alan.

On Wed, Jul 18, 2018 at 3:51 PM,   wrote:
> From: Alan Dewar 
>
> Add new APIs to retrieve counters in 64-bit wide fields.
>
> Signed-off-by: Alan Dewar 
> ---
>  lib/librte_sched/rte_sched.c   | 72 
>  lib/librte_sched/rte_sched.h   | 76 
> ++
>  lib/librte_sched/rte_sched_version.map |  2 +
>  3 files changed, 150 insertions(+)
>
> diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
> index 9269e5c..4396366 100644
> --- a/lib/librte_sched/rte_sched.c
> +++ b/lib/librte_sched/rte_sched.c
> @@ -1070,6 +1070,43 @@ rte_sched_subport_read_stats(struct rte_sched_port 
> *port,
> return 0;
>  }
>
> +int __rte_experimental
> +rte_sched_subport_read_stats64(struct rte_sched_port *port,
> +  uint32_t subport_id,
> +  struct rte_sched_subport_stats64 *stats64,
> +  uint32_t *tc_ov)
> +{
> +   struct rte_sched_subport *s;
> +   uint32_t tc;
> +
> +   /* Check user parameters */
> +   if (port == NULL || subport_id >= port->n_subports_per_port ||
> +   stats64 == NULL || tc_ov == NULL)
> +   return -1;
> +
> +   s = port->subport + subport_id;
> +
> +   /* Copy subport stats and clear */
> +   for (tc = 0; tc < RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE; tc++) {
> +   stats64->n_pkts_tc[tc] = s->stats.n_pkts_tc[tc];
> +   stats64->n_pkts_tc_dropped[tc] =
> +   s->stats.n_pkts_tc_dropped[tc];
> +   stats64->n_bytes_tc[tc] = s->stats.n_bytes_tc[tc];
> +   stats64->n_bytes_tc_dropped[tc] =
> +   s->stats.n_bytes_tc_dropped[tc];
> +#ifdef RTE_SCHED_RED
> +   stats64->n_pkts_red_dropped[tc] =
> +   s->stats.n_pkts_red_dropped[tc];
> +#endif
> +   }
> +   memset(&s->stats, 0, sizeof(struct rte_sched_subport_stats));
> +
> +   /* Subport TC oversubscription status */
> +   *tc_ov = s->tc_ov;
> +
> +   return 0;
> +}
> +
>  int
>  rte_sched_queue_read_stats(struct rte_sched_port *port,
> uint32_t queue_id,
> @@ -1099,6 +1136,41 @@ rte_sched_queue_read_stats(struct rte_sched_port *port,
> return 0;
>  }
>
> +int __rte_experimental
> +rte_sched_queue_read_stats64(struct rte_sched_port *port,
> +   uint32_t queue_id,
> +   struct rte_sched_queue_stats64 *stats64,
> +   uint16_t *qlen)
> +{
> +   struct rte_sched_queue *q;
> +   struct rte_sched_queue_extra *qe;
> +
> +   /* Check user parameters */
> +   if ((port == NULL) ||
> +   (queue_id >= rte_sched_port_queues_per_port(port)) ||
> +   (stats64 == NULL) ||
> +   (qlen == NULL)) {
> +   return -1;
> +   }
> +   q = port->queue + queue_id;
> +   qe = port->queue_extra + queue_id;
> +
> +   /* Copy queue stats and clear */
> +   stats64->n_pkts = qe->stats.n_pkts;
> +   stats64->n_pkts_dropped = qe->stats.n_pkts_dropped;
> +   stats64->n_bytes = qe->stats.n_bytes;
> +   stats64->n_bytes_dropped = qe->stats.n_bytes_dropped;
> +#ifdef RTE_SCHED_RED
> +   stats64->n_pkts_red_dropped = qe->stats.n_pkts_red_dropped;
> +#endif
> +   memset(&qe->stats, 0, sizeof(struct rte_sched_queue_stats));
> +
> +   /* Queue length */
> +   *qlen = q->qw - q->qr;
> +
> +   return 0;
> +}
> +
>  static inline uint32_t
>  rte_sched_port_qindex(struct rte_sched_port *port, uint32_t subport, 
> uint32_t pipe, uint32_t traffic_class, uint32_t queue)
>  {
> diff --git a/lib/librte_sched/rte_sched.h b/lib/librte_sched/rte_sched.h
> index 84fa896..f37a4d6 100644
> --- a/lib/librte_sched/rte_sched.h
> +++ b/lib/librte_sched/rte_sched.h
> @@ -141,6 +141,25 @@ struct rte_sched_subport_stats {
>  #endif
>  };
>
> +struct rte_sched_subport_stats64 {
> +   /* Packets */
> +   uint64_t n_pkts_tc[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> +   /**< Number of packets successfully written */
> +   uint64_t n_pkts_tc_dropped[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> +   /**< Number of packets dropped */
> +
> +   /* Bytes */
> +   uint64_t n_bytes_tc[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> +   /**< Number of bytes successfully written for each traffic class */
> +   uint64_t n_bytes_tc_dropped[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> +   /**< Number of bytes dropped for each traffic class */
> +
> +#ifdef RTE_SCHED_RED
> +   uint64_t n_pkts_red_dropped[RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE];
> +   /**< Number of packets dropped by red */
> +#endif
> +};
> +
>  /*
>   * Pipe configuration parameters. The period and credits_per_period
>   * parameters are measured in bytes, with one byte meaning the time
> @@ -182,6 +201,19 @@ struct rte_sched_queue_stats {
> uint32_t n_bytes_dropped;/**< Bytes dropped */
>  };
>
> +struct rte_sched_queue