> On 29 Apr 2020, at 00:55, Christian Hopps <cho...@chopps.org> wrote: > > I wrote this longish email with diagrams and what not and accidentally > deleted it, so this one's shorter.. sorry :) > > So I think this brings into question the value of doing more than a single > buffer worth of prefetch in a loop which prefetches multiple parts of the > buffer, right? > > I.e., it's suggesting that the best way to do this (if possible) is > > prefetch B-part1 > compute on B-part0 > prefetch B-part2 > compute on B-part1 > ... > prefetch B-part(n) > compute on B-part(n-1) > compute on B-part(n) > > And the only time it works to do more than 1 vlib_buffer_t prefetch would be > if the number of parts is 1 or something very close to 1. > > You mentioned seeing better performance, did you make any changes and measure > positive results?
> It would be interesting to look at those changes to help illuminate the > guidance. Looking for cache misses in a loop is probably useful in tuning > this. I haven't done that yet, but it'll be interesting when I get to that > point. :) > > Thanks, > Chris. I’m not aware of all micro-architectural issues of clustering prefetches, but what we know is that currently intel CPUs have 10 fill buffers [1]. According to my understanding fill buffers are consumed as long as there is pending memory access, either because of LOAD, PREFETCH or outstanding request from hardware L1 prefetchers. That basically means that if you are processing 4 packets in parallel, and prefetch metadata and one cacheline of data of each packet in the cluster, you will consume 8 fill buffers. If you try to read data from 4 packets immediately after those prefetches, Best case you will have 2 additional slots and 3rd load will stall untill one of used fill buffer become available. By interleaving prefetches with the rest of the code, you are basically reducing peak usage of fill buffers which can avoid such situations. There is PMC counter called L1D_PEND_MISS.FB_FULL which counts such events. I can give you “perf top” command line to monitor such events with exact location in the code if you want to play with that… — Damjan [1] https://en.wikichip.org/wiki/File:skylake_server_block_diagram.svg
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#16205): https://lists.fd.io/g/vpp-dev/message/16205 Mute This Topic: https://lists.fd.io/mt/73323447/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-