> On 29 Apr 2020, at 00:55, Christian Hopps <cho...@chopps.org> wrote:
> 
> I wrote this longish email with diagrams and what not and accidentally 
> deleted it, so this one's shorter.. sorry :)
> 
> So I think this brings into question the value of doing more than a single 
> buffer worth of prefetch in a loop which prefetches multiple parts of the 
> buffer, right?
> 
> I.e., it's suggesting that the best way to do this (if possible) is
> 
> prefetch B-part1
> compute on B-part0
> prefetch B-part2
> compute on B-part1
> ...
> prefetch B-part(n)
> compute on B-part(n-1)
> compute on B-part(n)
> 
> And the only time it works to do more than 1 vlib_buffer_t prefetch would be 
> if the number of parts is 1 or something very close to 1.
> 
> You mentioned seeing better performance, did you make any changes and measure 
> positive results?



> It would be interesting to look at those changes to help illuminate the 
> guidance. Looking for cache misses in a loop is probably useful in tuning 
> this. I haven't done that yet, but it'll be interesting when I get to that 
> point. :)
> 
> Thanks,
> Chris.


I’m not aware of all micro-architectural issues of clustering prefetches, but 
what we know is that
currently intel CPUs have 10 fill buffers [1]. According to my understanding 
fill buffers are consumed as long as there is pending memory access, either 
because of LOAD, PREFETCH or outstanding request from hardware L1 prefetchers.

That basically means that if you are processing 4 packets in parallel, and 
prefetch metadata and one cacheline of data of each packet in the cluster, you 
will consume 8 fill buffers. If you try to read data from 4 packets immediately 
after those prefetches, Best case you will have 2 additional slots and 3rd load 
will stall untill one of used fill buffer become available.

By interleaving prefetches with the rest of the code, you are basically 
reducing peak usage of fill buffers which can avoid such situations.

There is PMC counter called L1D_PEND_MISS.FB_FULL which counts such events. I 
can give you “perf top” command line to monitor such events with exact location 
in the code if you want to play with that…


— 
Damjan

[1] https://en.wikichip.org/wiki/File:skylake_server_block_diagram.svg

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16205): https://lists.fd.io/g/vpp-dev/message/16205
Mute This Topic: https://lists.fd.io/mt/73323447/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to