On Wed, 17 Feb 2016 14:40:26 +0000 "Stokes, Ian" <ian.sto...@intel.com> wrote:
> > -----Original Message----- > > From: dev [mailto:dev-boun...@openvswitch.org] On Behalf Of Stokes, Ian > > Sent: Thursday, February 11, 2016 2:42 PM > > To: Flavio Leitner > > Cc: dev@openvswitch.org > > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS functionality. > > > > > On Thu, 11 Feb 2016 12:32:56 +0000 > > > "Stokes, Ian" <ian.sto...@intel.com> wrote: > > > > > > > Thanks For the review Flavio, much appreciated, comments inline. > > > > I'll > > > re-spin a new version also. > > > > > > Thank you for the patch! > > > comments inline. > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Flavio Leitner [mailto:f...@sysclose.org] > > > > > Sent: Wednesday, February 10, 2016 7:55 PM > > > > > To: Stokes, Ian > > > > > Cc: dev@openvswitch.org > > > > > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS > > > functionality. > > > > > > > > > > On Mon, 1 Feb 2016 20:47:25 +0000 Ian Stokes > > > > > <ian.sto...@intel.com> wrote: > > > > > > > > > > > This patch provides the modifications required in netdev-dpdk.c > > > > > > and vswitch.xml to allow for a DPDK user space QoS algorithm. > > > > > > > > > > > > This patch adds a QoS configuration structure for netdev-dpdk > > > > > > and expected QoS operations 'dpdk_qos_ops'. Various helper > > > > > > functions are also supplied. > > > > > > > > > > > > Also included are the modifications required for vswitch.xml to > > > > > > allow a new QoS implementation for netdev-dpdk devices. This > > > > > > includes a new QoS type `egress-policer` as well as its expected > > > QoS table entries. > > > > > > > > > > > > The QoS functionality implemented for DPDK devices is `egress- > > > > > policer`. > > > > > > This can be used to drop egress packets at a configurable rate. > > > > > > > > > > > > The INSTALL.DPDK.md guide has also been modified to provide an > > > > > > example configuration of `egress-policer` QoS. > > > > > > > > > > > > Signed-off-by: Ian Stokes <ian.sto...@intel.com> > > > > > > --- > > > > > > INSTALL.DPDK.md | 20 +++ > > > > > > lib/netdev-dpdk.c | 439 > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++-- > > > > > > vswitchd/vswitch.xml | 52 ++++++ > > > > > > 3 files changed, 498 insertions(+), 13 deletions(-) > > > > > > > > > > > > diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index > > > > > > e8ef4b5..dac70cd > > > > > > 100644 > > > > > > --- a/INSTALL.DPDK.md > > > > > > +++ b/INSTALL.DPDK.md > > > > > > @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd: > > > > > > ./ovs-ofctl add-flow br0 in_port=2,action=output:1 > > > > > > ``` > > > > > > > > > > > > +8. QoS usage example > > > > > > + > > > > > > + Assuming you have a vhost-user port transmitting traffic > > > > > consisting of > > > > > > + packets of size 64 bytes, the following command would limit > > > > > > + the > > > > > egress > > > > > > + transmission rate of the port to ~1,000,000 packets per > > > second: > > > > > > + > > > > > > + `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos > > > > > > + create > > > > > qos > > > > > > + type=egress-policer other-config:cir=46000000 > > > > > > + other-config:cbs=2048` > > > > > > + > > > > > > + To examine the QoS configuration of the port: > > > > > > + > > > > > > + `ovs-appctl -t ovs-vswitchd qos/show vhost-user0` > > > > > > + > > > > > > + To clear the QoS configuration from the port and ovsdb use > > > > > > + the > > > > > following: > > > > > > + > > > > > > + `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port > > > > > > + vhost-user0 qos` > > > > > > + > > > > > > + For more details regarding egress-policer parameters please > > > > > > + refer > > > > > to the > > > > > > + vswitch.xml. > > > > > > + > > > > > > Performance Tuning: > > > > > > ------------------- > > > > > > > > > > > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index > > > > > > 09ccc2c..716da79 100644 > > > > > > --- a/lib/netdev-dpdk.c > > > > > > +++ b/lib/netdev-dpdk.c > > > > > > @@ -44,6 +44,7 @@ > > > > > > #include "ovs-rcu.h" > > > > > > #include "packets.h" > > > > > > #include "shash.h" > > > > > > +#include "smap.h" > > > > > > #include "sset.h" > > > > > > #include "unaligned.h" > > > > > > #include "timeval.h" > > > > > > @@ -52,12 +53,14 @@ > > > > > > > > > > > > #include "rte_config.h" > > > > > > #include "rte_mbuf.h" > > > > > > +#include "rte_meter.h" > > > > > > #include "rte_virtio_net.h" > > > > > > > > > > > > VLOG_DEFINE_THIS_MODULE(dpdk); > > > > > > static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); > > > > > > > > > > > > #define DPDK_PORT_WATCHDOG_INTERVAL 5 > > > > > > +#define DPDK_MAX_QOS_NAME_SIZE 10 > > > > Noted, will expand to 20 for the moment unless you think it should > > > > be > > > higher? > > > > > > I'd rather take that out completely. > > > > > > > Done, Daniele had a similar comment and it's not needed. > > > > > > > > + > > > > > > +/* Action that can be set for a packet for rte_meter */ enum > > > > > > +egress_policer_action { > > > > > > + GREEN = e_RTE_METER_GREEN, > > > > > > + DROP = 1, > > > > > > +}; > > > > > > > > > > I don't think it's a good idea to mix external and internal > > > > > declarations because if the external one changes, it will break > > > > > here. Since GREEN and DROP have meanings only inside of OVS, we > > > should define our own values. > > > > > See my comment where this is used. > > > > > > > > > Just to clarify, is this in relation to the use of > > e_RTE_METER_GREEN? > > > > Is it prefereable for GREEN to be equal to our own declaration? > > > > > > We can use e_RTE_METER_GREEN and we should do it to be consistent with > > > the rte function. My concern is having two declarations: one coming > > > from outside and another that you defined to a fixed number. So, what > > > happens if e_RTE_METER_GREEN gets shifted and now its value is 1? > > > > > > But with the simplification I proposed for > > > egress_policer_pkt_handle__() that enum isn't needed and we can take > > it out completely. > > > > > > > Agreed, I've tested your suggestion and it works fine + is cleaner. > > > > > > > > > > > static inline void > > > > > > netdev_dpdk_vhost_update_tx_counters(struct netdev_stats > > *stats, > > > > > > struct dp_packet > > > > > > **packets, @@ > > > > > > -1092,6 +1222,7 @@ __netdev_dpdk_vhost_send(struct netdev > > > > > > *netdev, int > > > > > qid, > > > > > > struct virtio_net *virtio_dev = > > > > > netdev_dpdk_get_virtio(vhost_dev); > > > > > > struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; > > > > > > unsigned int total_pkts = cnt; > > > > > > + unsigned int qos_pkts = cnt; > > > > > > uint64_t start = 0; > > > > > > > > > > > > if (OVS_UNLIKELY(!is_vhost_running(virtio_dev))) { @@ > > > > > > -1106,6 > > > > > > +1237,10 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int > > > > > > +qid, > > > > > > rte_spinlock_lock(&vhost_dev->tx_q[qid].tx_lock); > > > > > > } > > > > > > > > > > > > + /* Check has QoS has been configured for the netdev */ > > > > > > + cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt); > > > > > > + qos_pkts -= cnt; > > > > > > > > > > I don't see how this is working because > > > > > egress_policer_alg_process() will loop in the array one by one > > freeing or not the packet. > > > > > If it does nothing, then the array is good but otherwise it > > > > > creates a problem because the array still references freed > > > > > entries. So in loop below we are sending a mix of good and freed > > > > > buffers up to > > > 'cnt'. > > > > > > > > > So we should never have freed packets included in cnt as it will > > > > cause a segfault. Consider when an array of size cnt is passed to > > > > the netdev_dpdk_qos_process(). The timestamp used to update the > > > > token bucket is taken before the for loop is entered. This way the > > > > token bucket is only updated once for each batch of packets of size > > cnt. > > > > The loop cycles through the array, for any packet that is green the > > > > token bucket decremented. If a packet is dropped then the bucket has > > > > been exhausted. The token bucket will not be updated again within > > > > the run of the loop. Any packets remaining in the array will be > > > > dropped (i.e. freed) as the bucket is exhausted. The value of cnt > > > > will only be the number of pkts that were not dropped (i.e. not > > > > freed). This value will then be passed to the send function ensuring > > > > no freed packets are present in the array that is to be processed. > > > > If the token bucket updated between each individual packet being > > > > processed then you could have the situation you describe, by > > > > processing them as a batch with a single update beforehand you avoid > > it. > > > > > > I haven't looked into the rte implementation yet so I can't tell if > > > the decision to drop or not the packet is based solely on the > > > timestamp which doesn't change during the batch processing. > > > > > > Assuming that is the case I agree with you that the cnt would have > > > only the first valid entries to be sent out. > > > > > > My concern is when we are processing a batch with two packets. > > > One big packet and another small one. While processing the big > > > packet, there is no enough tokens and the packet is marked as > > > RED/DROPPED. Then the loop gets the second small packet, but now some > > > enough tokens were added/still available in the bucket which passes > > the small packet. > > > > > > > Apologies, I was looking at this the wrong way. I understand your point > > now. > > Let me think about this. And I'll post a rough solution here before > > submitting. > > So there are 4 cases we can have with policing in terms of how packets have > been processed. > > (i) The token bucket is not exhausted and no packets are dropped. > (ii) The token bucket is exhausted while processing and packets are dropped > sequentially for the remainder of the burst. > (iii) The token bucket is exhausted from previous call so all packets are > dropped. > (iv) Packets are not dropped sequentially, could occur with large packet > being dropped before smaller packet being processed. > > Currently we handle cases (i)(ii) & (iii). However case (iv) is not handled > and will cause a segfault. > > Egress_policer_run() returns a cnt of the number of packets remaining after > the policer has run. > The assumption here currently is that this cnt can be directly mapped to the > entries in the rte_mbuff array i.e. 0 to cnt-1 will contain pkts and will not > contain NULL entries or packets that have been freed. > > To deal with case (iv) we will have to ensure that the entries in the > rte_mbuff array from 0 to cnt-1 contains packets only. This will require the > entries in the rte_mbuff array to be re-arranged if 0 to cnt-1 contains freed > packets. > > How to do this as optimally as possible becomes the next question. > > A potential/rough solution would be as follows > > We introduce 2 new index values in egress_policer_run(). > Index value 1 'f_dropped' represents the array entry position where the first > packet dropped is located in the rte_mbuff array. > Index position 2 'l_sent' is the entry location of the last packet to be sent > in the array. > > Something like the following could be done to check and handle non sequential > drops. > > /* Check if drops are not sequential */ > if (f_dropped < l_sent) { > > /* Must move any packets that are present after drop > * so that all packets are sequential in rte_mbuff array > * pkts */ > for (x = f_dropped; x <= l_sent; x++) { > for (y = x+1; y <= l_sent; y++ ) { > if (pkts[y] != NULL) { > pkts[x] = pkts[y]; > pkts[y] = NULL; > break; > } > continue; > } > } > } > /* return cnt i.e. number of packets present in the array */ > return cnt; > > > Would something like what was outlined above be ok? Do people have a more > optimal solution? That seems to not work when you have multiple drops in the middle or maybe I missed something. What about something like this: int cnt = 0; int i = 0; for (i = 0; i < pkt_cnt; i++) { pkt = pkts[i]; /* Handle current packet */ if (egress_policer_pkt_handle__(&policer->egress_meter, pkt, current_time) == GREEN) { if (cnt != i) { pkts[cnt] = pkt; } cnt++; } else { rte_pktmbuf_free(pkt); } } return cnt; Thanks, -- fbl _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev