> -----Original Message----- > From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Stokes, Ian > Sent: Thursday, February 11, 2016 2:42 PM > To: Flavio Leitner > Cc: dev@openvswitch.org > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS functionality. > > > On Thu, 11 Feb 2016 12:32:56 +0000 > > "Stokes, Ian" <ian.sto...@intel.com> wrote: > > > > > Thanks For the review Flavio, much appreciated, comments inline. > > > I'll > > re-spin a new version also. > > > > Thank you for the patch! > > comments inline. > > > > > > > > > > > -----Original Message----- > > > > From: Flavio Leitner [mailto:f...@sysclose.org] > > > > Sent: Wednesday, February 10, 2016 7:55 PM > > > > To: Stokes, Ian > > > > Cc: dev@openvswitch.org > > > > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS > > functionality. > > > > > > > > On Mon, 1 Feb 2016 20:47:25 +0000 Ian Stokes > > > > <ian.sto...@intel.com> wrote: > > > > > > > > > This patch provides the modifications required in netdev-dpdk.c > > > > > and vswitch.xml to allow for a DPDK user space QoS algorithm. > > > > > > > > > > This patch adds a QoS configuration structure for netdev-dpdk > > > > > and expected QoS operations 'dpdk_qos_ops'. Various helper > > > > > functions are also supplied. > > > > > > > > > > Also included are the modifications required for vswitch.xml to > > > > > allow a new QoS implementation for netdev-dpdk devices. This > > > > > includes a new QoS type `egress-policer` as well as its expected > > QoS table entries. > > > > > > > > > > The QoS functionality implemented for DPDK devices is `egress- > > > > policer`. > > > > > This can be used to drop egress packets at a configurable rate. > > > > > > > > > > The INSTALL.DPDK.md guide has also been modified to provide an > > > > > example configuration of `egress-policer` QoS. > > > > > > > > > > Signed-off-by: Ian Stokes <ian.sto...@intel.com> > > > > > --- > > > > > INSTALL.DPDK.md | 20 +++ > > > > > lib/netdev-dpdk.c | 439 > > > > ++++++++++++++++++++++++++++++++++++++++++++++++-- > > > > > vswitchd/vswitch.xml | 52 ++++++ > > > > > 3 files changed, 498 insertions(+), 13 deletions(-) > > > > > > > > > > diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index > > > > > e8ef4b5..dac70cd > > > > > 100644 > > > > > --- a/INSTALL.DPDK.md > > > > > +++ b/INSTALL.DPDK.md > > > > > @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd: > > > > > ./ovs-ofctl add-flow br0 in_port=2,action=output:1 > > > > > ``` > > > > > > > > > > +8. QoS usage example > > > > > + > > > > > + Assuming you have a vhost-user port transmitting traffic > > > > consisting of > > > > > + packets of size 64 bytes, the following command would limit > > > > > + the > > > > egress > > > > > + transmission rate of the port to ~1,000,000 packets per > > second: > > > > > + > > > > > + `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos > > > > > + create > > > > qos > > > > > + type=egress-policer other-config:cir=46000000 > > > > > + other-config:cbs=2048` > > > > > + > > > > > + To examine the QoS configuration of the port: > > > > > + > > > > > + `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` > > > > > + > > > > > + To clear the QoS configuration from the port and ovsdb use > > > > > + the > > > > following: > > > > > + > > > > > + `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port > > > > > + vhost-user0 qos` > > > > > + > > > > > + For more details regarding egress-policer parameters please > > > > > + refer > > > > to the > > > > > + vswitch.xml. > > > > > + > > > > > Performance Tuning: > > > > > ------------------- > > > > > > > > > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > > > > > 09ccc2c..716da79 100644 > > > > > --- a/lib/netdev-dpdk.c > > > > > +++ b/lib/netdev-dpdk.c > > > > > @@ -44,6 +44,7 @@ > > > > > #include "ovs-rcu.h" > > > > > #include "packets.h" > > > > > #include "shash.h" > > > > > +#include "smap.h" > > > > > #include "sset.h" > > > > > #include "unaligned.h" > > > > > #include "timeval.h" > > > > > @@ -52,12 +53,14 @@ > > > > > > > > > > #include "rte_config.h" > > > > > #include "rte_mbuf.h" > > > > > +#include "rte_meter.h" > > > > > #include "rte_virtio_net.h" > > > > > > > > > > VLOG_DEFINE_THIS_MODULE(dpdk); > > > > > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > > > > > > > > > > #define DPDK_PORT_WATCHDOG_INTERVAL 5 > > > > > +#define DPDK_MAX_QOS_NAME_SIZE 10 > > > Noted, will expand to 20 for the moment unless you think it should > > > be > > higher? > > > > I'd rather take that out completely. > > > > Done, Daniele had a similar comment and it's not needed. > > > > > > + > > > > > +/* Action that can be set for a packet for rte_meter */ enum > > > > > +egress_policer_action { > > > > > + GREEN = e_RTE_METER_GREEN, > > > > > + DROP = 1, > > > > > +}; > > > > > > > > I don't think it's a good idea to mix external and internal > > > > declarations because if the external one changes, it will break > > > > here. Since GREEN and DROP have meanings only inside of OVS, we > > should define our own values. > > > > See my comment where this is used. > > > > > > > Just to clarify, is this in relation to the use of > e_RTE_METER_GREEN? > > > Is it prefereable for GREEN to be equal to our own declaration? > > > > We can use e_RTE_METER_GREEN and we should do it to be consistent with > > the rte function. My concern is having two declarations: one coming > > from outside and another that you defined to a fixed number. So, what > > happens if e_RTE_METER_GREEN gets shifted and now its value is 1? > > > > But with the simplification I proposed for > > egress_policer_pkt_handle__() that enum isn't needed and we can take > it out completely. > > > > Agreed, I've tested your suggestion and it works fine + is cleaner. > > > > > > > > static inline void > > > > > netdev_dpdk_vhost_update_tx_counters(struct netdev_stats > *stats, > > > > > struct dp_packet > > > > > **packets, @@ > > > > > -1092,6 +1222,7 @@ __netdev_dpdk_vhost_send(struct netdev > > > > > *netdev, int > > > > qid, > > > > > struct virtio_net *virtio_dev = > > > > netdev_dpdk_get_virtio(vhost_dev); > > > > > struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > > > > > unsigned int total_pkts = cnt; > > > > > + unsigned int qos_pkts = cnt; > > > > > uint64_t start = 0; > > > > > > > > > > if (OVS_UNLIKELY(!is_vhost_running(virtio_dev))) { @@ > > > > > -1106,6 > > > > > +1237,10 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > > > > > +qid, > > > > > rte_spinlock_lock(&vhost_dev->tx_q[qid].tx_lock); > > > > > } > > > > > > > > > > + /* Check has QoS has been configured for the netdev */ > > > > > + cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt); > > > > > + qos_pkts -= cnt; > > > > > > > > I don't see how this is working because > > > > egress_policer_alg_process() will loop in the array one by one > freeing or not the packet. > > > > If it does nothing, then the array is good but otherwise it > > > > creates a problem because the array still references freed > > > > entries. So in loop below we are sending a mix of good and freed > > > > buffers up to > > 'cnt'. > > > > > > > So we should never have freed packets included in cnt as it will > > > cause a segfault. Consider when an array of size cnt is passed to > > > the netdev_dpdk_qos_process(). The timestamp used to update the > > > token bucket is taken before the for loop is entered. This way the > > > token bucket is only updated once for each batch of packets of size > cnt. > > > The loop cycles through the array, for any packet that is green the > > > token bucket decremented. If a packet is dropped then the bucket has > > > been exhausted. The token bucket will not be updated again within > > > the run of the loop. Any packets remaining in the array will be > > > dropped (i.e. freed) as the bucket is exhausted. The value of cnt > > > will only be the number of pkts that were not dropped (i.e. not > > > freed). This value will then be passed to the send function ensuring > > > no freed packets are present in the array that is to be processed. > > > If the token bucket updated between each individual packet being > > > processed then you could have the situation you describe, by > > > processing them as a batch with a single update beforehand you avoid > it. > > > > I haven't looked into the rte implementation yet so I can't tell if > > the decision to drop or not the packet is based solely on the > > timestamp which doesn't change during the batch processing. > > > > Assuming that is the case I agree with you that the cnt would have > > only the first valid entries to be sent out. > > > > My concern is when we are processing a batch with two packets. > > One big packet and another small one. While processing the big > > packet, there is no enough tokens and the packet is marked as > > RED/DROPPED. Then the loop gets the second small packet, but now some > > enough tokens were added/still available in the bucket which passes > the small packet. > > > > Apologies, I was looking at this the wrong way. I understand your point > now. > Let me think about this. And I'll post a rough solution here before > submitting.
So there are 4 cases we can have with policing in terms of how packets have been processed. (i) The token bucket is not exhausted and no packets are dropped. (ii) The token bucket is exhausted while processing and packets are dropped sequentially for the remainder of the burst. (iii) The token bucket is exhausted from previous call so all packets are dropped. (iv) Packets are not dropped sequentially, could occur with large packet being dropped before smaller packet being processed. Currently we handle cases (i)(ii) & (iii). However case (iv) is not handled and will cause a segfault. Egress_policer_run() returns a cnt of the number of packets remaining after the policer has run. The assumption here currently is that this cnt can be directly mapped to the entries in the rte_mbuff array i.e. 0 to cnt-1 will contain pkts and will not contain NULL entries or packets that have been freed. To deal with case (iv) we will have to ensure that the entries in the rte_mbuff array from 0 to cnt-1 contains packets only. This will require the entries in the rte_mbuff array to be re-arranged if 0 to cnt-1 contains freed packets. How to do this as optimally as possible becomes the next question. A potential/rough solution would be as follows We introduce 2 new index values in egress_policer_run(). Index value 1 'f_dropped' represents the array entry position where the first packet dropped is located in the rte_mbuff array. Index position 2 'l_sent' is the entry location of the last packet to be sent in the array. Something like the following could be done to check and handle non sequential drops. /* Check if drops are not sequential */ if (f_dropped < l_sent) { /* Must move any packets that are present after drop * so that all packets are sequential in rte_mbuff array * pkts */ for (x = f_dropped; x <= l_sent; x++) { for (y = x+1; y <= l_sent; y++ ) { if (pkts[y] != NULL) { pkts[x] = pkts[y]; pkts[y] = NULL; break; } continue; } } } /* return cnt i.e. number of packets present in the array */ return cnt; Would something like what was outlined above be ok? Do people have a more optimal solution? > > > > > > > > + > > > > > +static inline int > > > > > +egress_policer_pkt_handle__(struct rte_meter_srtcm *meter, > > > > > + struct rte_mbuf *pkt, uint64_t > > > > > +time) { > > > > > + uint8_t output_color; > > > > > + uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct > > > > ether_hdr); > > > > > + enum egress_policer_action action = GREEN; > > > > > + > > > > > + /* color input is not used for blind modes */ > > > > > + output_color = (uint8_t) > > > > > rte_meter_srtcm_color_blind_check(meter, > > > > time, > > > > > + > > > > pkt_len); > > > > > + > > > > > + /* Check output color, 0 corresponds to GREEN i.e. packet > > > > > can be > > > > > + * transmitted. If greater than 0 then action is set to > > > > > DROP. */ > > > > > + if (output_color) { > > > > > + action = DROP; > > > > > + } > > > > > > > > > > > > This could be simpler: > > > > > > > > static inline int > > > > egress_policer_pkt_handle__(struct rte_meter_srtcm *meter, > > > > struct rte_mbuf *pkt, uint64_t time) { > > > > uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct > > > > ether_hdr); > > > > > > > > /* If GREEN, accept it otherwise drop it */ > > > > if (rte_meter_srtcm_color_blind_check(meter, time, pkt_len) > > > > != > > > > e_RTE_METER_GREEN) { > > > > return 1; > > > > } > > > > > > > > return 0; > > > > } > > > > > > > > > > > Agreed this is simpler/better. In relation to the external > > > declaration comment above, Is it preferable not to use > > > e_RTE_METER_GREEN here and instead have a value defined in OVS? > > > > We should use it to be consistent with the rte implementation. > Done. > > > > Thanks, > > -- > > fbl > > _______________________________________________ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev