Hi Damjan,

________________________________
From: Damjan Marion <dmar...@me.com>
Sent: Thursday, March 21, 2019 8:47 PM
To: Nitin Saxena
Cc: vpp-dev@lists.fd.io; Narayana Prasad Raju Athreya
Subject: Re: [EXT] Re: [vpp-dev] 128 byte cache line support



On 21 Mar 2019, at 16:04, Nitin Saxena 
<nsax...@marvell.com<mailto:nsax...@marvell.com>> wrote:

Hi Damjan,

________________________________
From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> on behalf of Damjan Marion 
via Lists.Fd.Io <dmarion=me....@lists.fd.io<mailto:dmarion=me....@lists.fd.io>>
Sent: Thursday, March 21, 2019 6:03 PM
To: Nitin Saxena
Cc: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: Re: [EXT] Re: [vpp-dev] 128 byte cache line support



> On 21 Mar 2019, at 06:51, Nitin Saxena 
> <nsax...@marvell.com<mailto:nsax...@marvell.com>> wrote:
>
> Hi,
>
> First all sorry for responding late to this mail chain. Please see my answers 
> inline in blue
>
> Thanks,
> Nitin
>
>
> From: Damjan Marion <dmar...@me.com<mailto:dmar...@me.com>>
> Sent: Monday, March 18, 2019 4:48 PM
> To: Honnappa Nagarahalli
> Cc: vpp-dev; Nitin Saxena
> Subject: [EXT] Re: [vpp-dev] 128 byte cache line support
>
> External Email
>
>
>> On 15 Mar 2019, at 04:52, Honnappa Nagarahalli 
>> <honnappa.nagaraha...@arm.com<mailto:honnappa.nagaraha...@arm.com>> wrote:
>>
>>
>>
>> Related to change 18278[1], I was wondering if there is really a benefit of 
>> dealing with 128-byte cachelines like we do today.
>> Compiling VPP with cacheline size set to 128 will basically just add 64 
>> bytes of unused space at the end of each cacheline so
>> vlib_buffer_t for example will grow from 128 bytes to 256 bytes, but we will 
>> still need to prefetch 2 cachelines like we do by default.
>>
>> [Nitin]: This is the existing model. In case of forwarding mainly first vlib 
>> cache line size is being used. We are utilising existing hole (in first vlib 
>> cache line) by putting packet parsing info (Size ==64B). This has many 
>> benefits, one of them is to avoid ipv4-input-no-chksum() software checks. It 
>> gives us ~20 cycles benefits on our platform. So I do not want to lose that 
>> gain.

That sounds like a terribly bad idea, and it likely will never be upstreamed.
vlib_buffer_t is 128-byte data structure, and it is perfect fit for 128-byte 
cacheline size systems. I don't see a point in growing this to 256-byte.
If you need more space, you can always grow headroom space for additional 
cacheline and store whatever you want there.
[Nitin]: In current upstream code, size of VLIB_BUFFER_HDR_SIZE is 256 byte for 
128B cache line target and not 128 byte. So we haven't done any changes in the 
upstream code except introducing CLIB_MIN_CACHE_LINE_BYTES == 64 macro and 
putting it across vlib_buffer_t as MARKER. So we are utilising existing hole 
(unused) in first cache line of vlib_buffer_t.

I understood that, i’m proposing that we fix that so header size is always 128, 
as it should be.
[Nitin]: At this point I am not in favour of changing existing vlib_buffer_t 
structure as I already mentioned that we gain ~20 cycles in case of L3Fwd. I 
don't have data how much it would help me in TCP or UDP handling where we would 
touch second cache line of vlib header.

>> Whta will happen if we just leave that to be 64?
>>
>> [Nitin]: This will create L1D holes on 128B targets right?
>>
>> Unutilized holes are not acceptable as it will waste L1D space and thereby 
>> affecting performance. On the contrary we want to pack structures from  
>> 2x64B to 1x128B cache line size to reduce number of pending prefetches in 
>> core pipeline. VPP heavily prefetches LOAD/STORE version of 64B and our 
>> effort is to reduce them for our target.
>>

Not sure what do you mean by L1D holes. My proposal is that we align all 
per-thread data structures to 128 bytes, not to grow anything.
[Nitin]: Ok I misunderstood your comment when you said "if we just leave that 
to be 64".

>> [Honnappa] Currently, ThunderX1 and Octeon TX have 128B cache line. What I 
>> have heard from Marvel folks is 64B cache line setting in DPDK does not 
>> work. I have not gone into details on what does not work exactly. May be 
>> Nitin can elaborate.
>>
> I’m curious to hear details…
>>
>> 1. sometimes (and not very frequently) we will issue 2 prefetch instructions 
>> for same cacheline, but I hope hardware is smart enough to just ignore 2nd 
>> one
>>
>> 2. we may face false sharing issues if first 64 bytes is touched by one 
>> thread and another 64 bytes are touched by another one
>>
>> Second one sounds to me like a real problem, but it can be solved by 
>> aligning all per-thread data structures to 2 x cacheline size.
>>
>> [Honnappa] Sorry, I don’t understand you here. Even if the data structure is 
>> aligned on 128B (2 X 64B), 2 contiguous blocks of 64B data would be on a 
>> single cache line.
>>
> I wanted to say that we can align all per-thread data structures to 128 
> bytes, even on systems which have 64 byte cacheline size.
>>
>> Actually If i remember correctly, even on x86 some of hardware prefetchers 
>> are dealing with blocks of 2 cachelines.
>>
>> So unless I missed something, my proposal here is, instead of maintaining 
>> special 128 byte images for some ARM64 machines,
>> let’s just align all per-thread data structures to 128 and have just one ARM 
>> image.
>>
>> [Honnappa] When we run VPP compiled with 128B cache line size on platforms 
>> with 64B cache line size, there is a performance degradation.
>>
> Yeah, sure, what I’m suggesting here is how to address that perf degradation.
> [Nitin]: Is this proposal for Intel as well? If yes then I am fine with the 
> proposal but I think it will decrease performance on 64B architecture with 
> existing code.

I’m curious to hear why do you think so…
[Nitin]: So if all per-thread data structures have single cache line MARKER and 
they are always prefetched from this marker (with number of 
CLIB_CACHE_LINE_BYTES as an argument and not with size of structure) then yeah 
it should be fine.

I’m proposing that all per-thread data structures are aligned to 128 
independently of cache line size.
Code should prefetch what it is interested in, and in worst case it will issue 
two prefetches for the same cacheline
but I expect that your hardware is smart enough to just ignore 2nd prefetch.
[Nitin]: Ok. Theoretically I think it should work and the second prefetch (on 
64th byte) will be ignored. But I will check internally as well. Thanks

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12597): https://lists.fd.io/g/vpp-dev/message/12597
Mute This Topic: https://lists.fd.io/mt/30633927/675191
Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io>
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  
[nsax...@marvell.com<mailto:nsax...@marvell.com>]
-=-=-=-=-=-=-=-=-=-=-=-

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12603): https://lists.fd.io/g/vpp-dev/message/12603
Mute This Topic: https://lists.fd.io/mt/30633927/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to