> On 4 Feb 2019, at 19:38, Jerin Jacob Kollanukkaran <jer...@marvell.com> wrote:
> 
> On Mon, 2019-02-04 at 19:32 +0100, Damjan Marion wrote:
>> 
>> 
>>> On 4 Feb 2019, at 14:19, Jerin Jacob Kollanukkaran <jer...@marvell.com 
>>> <mailto:jer...@marvell.com>> wrote:
>>> 
>>> On Sun, 2019-02-03 at 21:13 +0100, Damjan Marion wrote:
>>>> External Email
>>>> 
>>>>> On 3 Feb 2019, at 20:13, Saxena, Nitin <nitin.sax...@cavium.com 
>>>>> <mailto:nitin.sax...@cavium.com>> wrote:
>>>>> 
>>>>> Hi Damjan,
>>>>> 
>>>>> See function octeontx_fpa_bufpool_alloc() called by 
>>>>> octeontx_fpa_dequeue(). Its a single read instruction to get the pointer 
>>>>> of data.
>>>> 
>>>> Yeah saw that, and today vpp buffer manager can grab up to 16 buffer 
>>>> indices with one instructions so no big deal here....
>>>> 
>>>>> Similarly, octeontx_fpa_bufpool_free() is also a single write 
>>>>> instruction. 
>>>>> 
>>>>>> So, If you are able to prove with numbers that current software solution 
>>>>>> is low-performant and that you are confident that you can do 
>>>>>> significantly better, I will be happy to work with you on implementing 
>>>>>> support for hardware buffer manager.
>>>>> First of all I welcome your patch as we were also trying to remove 
>>>>> latencies seen by memcpy_x4() of buffer template. As I said earlier 
>>>>> hardware buffer coprocessor is being used by other packet engines hence 
>>>>> the support has to be added in VPP. I am looking for suggestion for its 
>>>>> resolution. 
>>>> 
>>>> You can hardly get any suggestion from my side if you are ignoring my 
>>>> questions, which I asked in my previous email to get better understanding 
>>>> of what your hardware do.
>>>> 
>>>> "It is hardware so it is fast" is not real argument, we need real 
>>>> datapoints before investing time into this area....
>>> 
>>> 
>>> Adding more details of HW mempool manger attributes:
>>> 
>>> 1) Semantically HW mempool manager is same as SW mempool manger
>>> 2) HW mempool mangers has "alloc/dequeue" and "free/enqueue" operation as 
>>> SW mempool manager
>>> 3) HW mempool mangers can work with SW per core local cache scheme too
>>> 4) user metadata initialization is not done in HW. SW needs to do before 
>>> free() or after alloc()
>>> 5) Typically it has an operation to "Dont free" the packet after Tx. Which 
>>> can be used as back end to clone the packet(aka reference count schemes)
>>> 6) How does HW pool manger improves the performance:
>>> - MP/MC can work without locks(HW takes care internally)
>>> - HW Frees the buffer on Tx unlike core does in SW mempool case. So it does 
>>> save CPU cycles packet Tx and cost of bringing packet again
>>> in L1 cache.
>>> - On the RX side, HW alloc/dequeue packet from mempool. No SW intervention 
>>> required.
>>> 
>>> In terms of abstraction. DPDK mempool manger does abstract SW and HW 
>>> mempool though static struct rte_mempool_ops.
>>> 
>>> Limitations:
>>> 1) Some NPU packet processing HW can work only with HW mempool manger.(Aka 
>>> it can not work with SW mempool manager
>>> as on the RX, HW looks for mempool manager to alloc and then form the 
>>> packet)
>>> 
>>> Using DPDK abstractions will enable to write agositic software which works 
>>> NPU and CPUs models.
>> 
>> VPP is not DPDK application so that doesn't work for us. DPDK is just one 
>> optional device driver access method
>> and I hear more and more people asking for VPP without DPDK.
>> 
>> We can implement hardware buffer manager support in VPP, but honestly I'm 
>> not convinced that will bring any huge value and 
>> justify time investment. I would like that somebody proves me wrong, but 
>> with real data, not with statements like "it is hardware so it is faster".
> 
> I believe, I have listed the HW buffer manager attributes and how it works 
> and what gain it gives(See point 6)

Let me just confirm, in DPDK case, you are checking refcnt as part of tx 
enqueue, and marking such buffer with don't free flag.
So packets which have refcnt==1 actually never end up in mepool cache. If this 
is correct understanding, how that fits int your
statement that it can work with per core cache scheme.

What happens with with packets which are marked as don't free? How do you deal 
with refcnt decrement? And how do you track them?

> Need to do it if VPP needs to support NPU.

New NPU support in VPP can be done quite easily. It took me less than a week to 
introduce support for Marvell PP2.

> In terms of data point, What data point you would like to have?

Expected performance gain. I.e. today vpp takes roughly 100 clocks/packet for 
full ip4 forwarding baseline test on x86.
IP4 forwarding means (rx ring enqueue, ethertype lookup, ip4 mandatory checks, 
ip4 lookup, l2 header rewrite, tx enqueue, tx buffer free, counters).
I would like to understand what are the numbers for similar test on arm today 
and how much of improvement you expect by implementing hw buffer manager.
 
-- 
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12165): https://lists.fd.io/g/vpp-dev/message/12165
Mute This Topic: https://lists.fd.io/mt/29655016/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to