On Mon, 2019-02-04 at 19:32 +0100, Damjan Marion wrote:
On 4 Feb 2019, at 14:19, Jerin Jacob Kollanukkaran <jer...@marvell.com<mailto:jer...@marvell.com>> wrote: On Sun, 2019-02-03 at 21:13 +0100, Damjan Marion wrote: External Email On 3 Feb 2019, at 20:13, Saxena, Nitin <nitin.sax...@cavium.com<mailto:nitin.sax...@cavium.com>> wrote: Hi Damjan, See function octeontx_fpa_bufpool_alloc() called by octeontx_fpa_dequeue(). Its a single read instruction to get the pointer of data. Yeah saw that, and today vpp buffer manager can grab up to 16 buffer indices with one instructions so no big deal here.... Similarly, octeontx_fpa_bufpool_free() is also a single write instruction. So, If you are able to prove with numbers that current software solution is low-performant and that you are confident that you can do significantly better, I will be happy to work with you on implementing support for hardware buffer manager. First of all I welcome your patch as we were also trying to remove latencies seen by memcpy_x4() of buffer template. As I said earlier hardware buffer coprocessor is being used by other packet engines hence the support has to be added in VPP. I am looking for suggestion for its resolution. You can hardly get any suggestion from my side if you are ignoring my questions, which I asked in my previous email to get better understanding of what your hardware do. "It is hardware so it is fast" is not real argument, we need real datapoints before investing time into this area.... Adding more details of HW mempool manger attributes: 1) Semantically HW mempool manager is same as SW mempool manger 2) HW mempool mangers has "alloc/dequeue" and "free/enqueue" operation as SW mempool manager 3) HW mempool mangers can work with SW per core local cache scheme too 4) user metadata initialization is not done in HW. SW needs to do before free() or after alloc() 5) Typically it has an operation to "Dont free" the packet after Tx. Which can be used as back end to clone the packet(aka reference count schemes) 6) How does HW pool manger improves the performance: - MP/MC can work without locks(HW takes care internally) - HW Frees the buffer on Tx unlike core does in SW mempool case. So it does save CPU cycles packet Tx and cost of bringing packet again in L1 cache. - On the RX side, HW alloc/dequeue packet from mempool. No SW intervention required. In terms of abstraction. DPDK mempool manger does abstract SW and HW mempool though static struct rte_mempool_ops. Limitations: 1) Some NPU packet processing HW can work only with HW mempool manger.(Aka it can not work with SW mempool manager as on the RX, HW looks for mempool manager to alloc and then form the packet) Using DPDK abstractions will enable to write agositic software which works NPU and CPUs models. VPP is not DPDK application so that doesn't work for us. DPDK is just one optional device driver access method and I hear more and more people asking for VPP without DPDK. We can implement hardware buffer manager support in VPP, but honestly I'm not convinced that will bring any huge value and justify time investment. I would like that somebody proves me wrong, but with real data, not with statements like "it is hardware so it is faster". I believe, I have listed the HW buffer manager attributes and how it works and what gain it gives(See point 6) Need to do it if VPP needs to support NPU. In terms of data point, What data point you would like to have? -- Damjan
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#12166): https://lists.fd.io/g/vpp-dev/message/12166 Mute This Topic: https://lists.fd.io/mt/29655016/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-