From: Chas Williams <3ch...@gmail.com> >On Thu, Aug 23, 2018 at 3:28 AM Matan Azrad <mailto:ma...@mellanox.com> wrote: >Hi > >From: Eric Kinzie >> On Wed Aug 22 11:42:37 +0000 2018, Matan Azrad wrote: >> > Hi Luca >> > >> > From: Luca Boccassi >> > > On Wed, 2018-08-22 at 07:09 +0000, Matan Azrad wrote: >> > > > Hi Chas >> > > > >> > > > From: Chas Williams >> > > > > On Tue, Aug 21, 2018 at 11:43 AM Matan Azrad >> > > > > <mailto:mailto:matan@mellanox .com> wrote: >> > > > > Hi Chas >> > > > > >> > > > > From: Chas Williams >> > > > > > On Tue, Aug 21, 2018 at 6:56 AM Matan Azrad >> > > > > > <mailto:mailto:matan@mellano >> > > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fx.com&data=02%7C01%7Cmatan%40mellanox.com%7Cc662ec1ee7734d12025808d609104474%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636706362822778011&sdata=MNAx1E5TgOrzXO9N8SWCOojrWmbqD8DPND%2BCXorOYhQ%3D&reserved=0> >> > > > > > wrote: >> > > > > > Hi >> > > > > > >> > > > > > From: Chas Williams >> > > > > > > This will need to be implemented for some of the other RX >> > > > > > > burst methods at some point for other modes to see this >> > > > > > > performance improvement (with the exception of active-backup). >> > > > > > >> > > > > > Yes, I think it should be done at least to >> > > > > > bond_ethdev_rx_burst_8023ad_fast_queue (should be easy) for >> now. >> > > > > > >> > > > > > There is some duplicated code between the various RX paths. >> > > > > > I would like to eliminate that as much as possible, so I was >> > > > > > going to give that some thought first. >> > > > > >> > > > > There is no reason to stay this function as is while its twin is >> > > > > changed. >> > > > > >> > > > > Unfortunately, this is all the patch I have at this time. >> > > > > >> > > > > >> > > > > > >> > > > > > >> > > > > > > On Thu, Aug 16, 2018 at 9:32 AM Luca Boccassi >> > > > > > > <mailto:mailto:bluca@deb >> > > > > > > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fian.org&data=02%7C01%7Cmatan%40mellanox.com%7Cc662ec1ee7734d12025808d609104474%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636706362822778011&sdata=AAcm%2FcbAA4CQsOnXFZWIqii6T%2BFUcc8xxT7%2Fs3tKIfY%3D&reserved=0> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > During bond 802.3ad receive, a burst of packets is fetched >> > > > > > > > from each slave into a local array and appended to >> > > > > > > > per-slave ring buffer. >> > > > > > > > Packets are taken from the head of the ring buffer and >> > > > > > > > returned to the caller. The number of mbufs provided to >> > > > > > > > each slave is sufficient to meet the requirements of the >> > > > > > > > ixgbe vector receive. >> > > > > > >> > > > > > Luca, >> > > > > > >> > > > > > Can you explain these requirements of ixgbe? >> > > > > > >> > > > > > The ixgbe (and some other Intel PMDs) have vectorized RX >> > > > > > routines that are more efficient (if not faster) taking >> > > > > > advantage of some advanced CPU instructions. I think you need >> > > > > > to be receiving at least 32 packets or more. >> > > > > >> > > > > So, why to do it in bond which is a generic driver for all the >> > > > > vendors PMDs, If for ixgbe and other Intel nics it is better you >> > > > > can force those PMDs to receive always 32 packets and to manage >> > > > > a ring by themselves. >> > > > > >> > > > > The drawback of the ring is some additional latency on the >> > > > > receive path. >> > > > > In testing, the additional latency hasn't been an issue for bonding. >> > > > >> > > > When bonding does processing slower it may be a bottleneck for the >> > > > packet processing for some application. >> > > > >> > > > > The bonding PMD has a fair bit of overhead associated with the >> > > > > RX and TX path calculations. Most applications can just arrange >> > > > > to call the RX path with a sufficiently large receive. Bonding >> > > > > can't do this. >> > > > >> > > > I didn't talk on application I talked on the slave PMDs, The slave >> > > > PMD can manage a ring by itself if it helps for its own performance. >> > > > The bonding should not be oriented to specific PMDs. >> > > >> > > The issue though is that the performance problem is not with the >> > > individual PMDs - it's with bonding. There were no reports regarding >> > > the individual PMDs. >> > > This comes from reports from customers from real world production >> > > deployments - the issue of bonding being too slow was raised multiple >> times. >> > > This patch addresses those issues, again in production deployments, >> > > where it's been used for years, to users and customers satisfaction. >> > >> > From Chas I understood that using burst of 32 helps for some slave PMDs >> performance which makes sense. >> > I can't understand how the extra copy phases improves the bonding itself >> performance: >> > >> > You added 2 copy phases in the bonding RX function: >> > 1. Get packets from the slave to a local array. >> > 2. Copy packet pointers from a local array to the ring array. >> > 3. Copy packet pointers from the ring array to the application array. >> > >> > Each packet arriving to the application must pass the above 3 phases(in a >> specific call or in previous calls). >> > >> > Without this patch we have only - >> > Get packets from the slave to the application array. >> > >> > Can you explain how the extra copies improves the bonding performance? >> > >> > Looks like it improves the slaves PMDs and because of that the bonding >> PMD performance becomes better. >> >> I'm not sure that adding more buffer management to the vector PMDs will >> improve the drivers' performance; it's just that calling the rx function in >> such >> a way that it returns no data wastes time. > >Sorry, I don't fully understand what you said here, please rephrase. > >> The bonding driver is already an exercise in buffer management so adding >>this layer of indirection here makes >> sense in my opinion, as does hiding the details of the consituent interfaces >> where possible. > >Can you explain how this new buffer management with the extra pointer copies >improves the bonding itself performance? >Looks really strange to me. > >Because rings are generally quite efficient.
But you are using a ring in addition to regular array management, it must hurt performance of the bonding PMD (means the bonding itself - not the slaves PMDs which are called from the bonding) > >Bonding is in a middle ground between application and PMD. Yes. >What bonding is doing, may not improve all applications. Yes, but it can be solved using some bonding modes. > If using a ring to buffer the vectorized receive routines, improves your > particular application, >that's great. It may be not great and even bad for some other PMDs which are not vectororized. > However, I don't think I can say that it would help all >applications. As you point out, there is overhead associated with >a ring. Yes. >Bonding's receive burst isn't especially efficient (in mode 4). Why? > Bonding benefits from being able to read as much as possible (within limits of >course, large reads would blow out caches) from each slave. The slaves PMDs can benefits in the same way. >It can't return all that data though because applications tend to use the >burst size that would be efficient for a typical PMD. What is the preferred burst size of the bonding? Maybe the application should use it when they are using bonding. >An alternative might be to ask bonding applications to simply issue larger >reads for >certain modes. That's probably not as easy as it sounds given the >way that the burst length effects multiplexing. Can you explain it more? >Another solution might be just alternatively poll the individual >slaves on each rx burst. But that means you need to poll at a >faster rate. Depending on your application, you might not be >able to do that. Again, can you be more precise in the above explanation? > We can avoid this scheduling overhead by just >doing the extra reads in bonding and buffering in a ring. > >Since bonding is going to be buffering everything in a ring, ? I don't familiar with it. For now I don't think we need a ring. >it makes sense to just read as much as is as efficiently possible. For the >Intel adapters, this means using a read big enough trigger the vectorized >versions. >> > > So I'd like to share this improvement rather than keeping it private >> > > - because I'm nice that way :-P >> > > >> > > > > > Did you check for other vendor PMDs? It may hurt performance >> > > > > > there.. >> > > > > > >> > > > > > I don't know, but I suspect probably not. For the most part >> > > > > > you are typically reading almost up to the vector requirement. >> > > > > > But if one slave has just a single packet, then you can't >> > > > > > vectorize on the next slave. >> > > > > > >> > > > > >> > > > > I don't think that the ring overhead is better for PMDs which >> > > > > are not using the vectorized instructions. >> > > > > >> > > > > The non-vectorized PMDs are usually quite slow. The additional >> > > > > overhead doesn't make a difference in their performance. >> > > > >> > > > We should not do things worse than they are. >> > > >> > > There were no reports that this made things worse. The feedback from >> > > production was that it made things better. >> > >> > Yes, It may be good for specific slaves drivers but hurt another >> > slaves drivers, So maybe it should stay private to specific costumers using >> specific nics. >> > >> > Again, I can understand how this patch improves performance of some >> > PMDs therefore I think the bonding is not the place to add it but maybe >> some PMDs. >