+1 Florin
> On Jul 14, 2021, at 10:47 AM, Damjan Marion via lists.fd.io > <dmarion=me....@lists.fd.io> wrote: > > > I spent a bit of time to look at this and come up with some reasonable > solution. > > First, 128-byte cacheline is not dead, recently announced Marvell Octeon 10 > have 128 byte cacheline. > > In current code cacheline size defines both amount of data prefetch > instruction prefetches and > alignment of data in data structures needed to avoid false sharing. > > So I think ideally we should have following: > > - on x86: > - number of bytes prefetch instruction prefetches set to 64 > - data structures should be aligned to 64 bytes > - due the fact that there is adjacent cacehline prefetcher on x86 it may be > worth > investigating if aligning to 128 brings some value > > - on AArch64 > - number of bytes prefetch instruction prefetches set to 64 or 128, based on > multiarch variant running > - data structures should be aligned to 128 bytes as that value prevents > false sharing for both 64 and 128 byte cacheline systems > > Main problem is abuse of CLIB_PREFETCH() macro in our codebase. > Original idea of it was good, somebody wanted to provide macro which > transparently emits 1-4 prefetch > instructions based on data size recognising that there may be systems with > different cacheline size > > Like: > CLIB_PREFETCH (p, sizeof (ip6_header_t), LOAD); > > But reality is, most of the time we have: > CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, LOAD); > > Where it is assumed that cacheline size is 64 and that just wasted resources > on system with 128-byte cacheline. > > Also, most of places in our codebase are perfectly fine with whatever > cacheline size is, so I’m thinking about following: > > 1. set CLIB_CACHE_LINE_BYTES to 64 on x86 and 128 on ARM, that will make sure > false sharing is not happening > > 2. introduce CLIB_CACHE_PREFETCH_BYTES which can be set to different value > for different multiarch variant (64 for N1, 128 ThinderX2) > > 3. modify CLIB_PREFETCH macro to use CLIB_CACHE_PREFETCH_BYTES to emit proper > number of prefetch instructions for cases where data size is specified > > 4. take the stub and replace all `CLIB_PREFETCH (p, CLIB_CACHE_LINE_BYTES, > LOAD);` with `clib_prefetch_load (p);`. > There may be exceptions but those lines typically mean: "i want to prefetch > few (<=64) bytes at this address and i really don’t care what the cache line > size is”. > > 5. analyse remaining few cases where CLIB_PREFETCH() is used with size > specified by CLIB_CACHE_LINE_BYTES. > > Thoughts? > > — > Damjan > >> On 06.07.2021., at 03:48, Lijian Zhang <lijian.zh...@arm.com> wrote: >> >> Thanks Damjan for your comments. Some replies in lines. >> <snip> >> Hi Lijian, >> >> It will be good to know if 128 byte cacheline is something ARM platforms >> will be using in the future or it is just historical. >> [Lijian] Existing ThunderX1 and OcteonTX2 CPUs are 128 byte cache-line. To >> my knowledge, there may be more CPUs with 128 byte cache-line in the future. >> >> Cacheline size problem is not just about prefetching, even bigger issue is >> false sharing, so we need to address both. >> [Lijian] Yes, there may be false-sharing issue when running VPP image with >> 64B definition on 128B cache-line CPUs. We will do some scalability testing >> with that case, and check the multi-core performance. >> >> Probably best solution is to have 2 VPP images, one for 128 and one for 64 >> byte cacheline size. >> [Lijian] For native built image, that’s fine. But I’m not sure if it’s >> possible for cloud binaries installed via “apt-get install”. >> >> Going across the whole codebase and replacing prefetch macros is something >> we should definitely avoid. >> [Lijian] I got your concerns on large scope replacement. My concern is when >> CLIB_PREFETCH() is used to prefetch packet content into cache as below >> example, cache-line (CLIB_CACHE_LINE_BYTES) seems to be assumed as 64 bytes >> always. >> CLIB_PREFETCH (p2->data, 3 * CLIB_CACHE_LINE_BYTES, LOAD); >> >> — >> Damjan >> >> >> On 05.07.2021., at 07:28, Lijian Zhang <lijian.zh...@arm.com> wrote: >> >> Hi Damjan, >> I committed several patches to address some issues around cache-line >> definitions in VPP. >> >> Patch [1.1] is to resolve the build error [2] on 64Byte cache line Arm CPUs, >> e.g., ThunderX2, NeoverseN1, caused by the commit >> (https://gerrit.fd.io/r/c/vpp/+/32996, build: remove unused files and >> sections). >> It also supports building Arm generic image (with command of “make >> build-release”) with 128Byte cache line definition, and building native >> image with 64Byte cache line definition on some Arm CPUs, e.g., ThunderX2, >> NeoverseN1 (with command of “make build-release TARGET_PLATFORM=native”). >> >> Patch [1.5] is to set the default cache line definition in Arm generic image >> from 128Byte to 64Byte. >> Setting cache line definition to 128Byte for Arm generic image is required >> for ThunderX1 (with 128Byte physical cache line), which is also the build >> machine in FD.io lab. I’m thinking for setting 64Byte cache line definition >> in VPP for Arm image, which will affect ThunderX1 and OcteonTX2 CPUs. So it >> requires the confirmation by Marvell. >> >> Arm architecture CPUs have 128Byte or 64Byte physical designs. So no matter >> the cache line definition is 128Byte or 64Byte in VPP source code, the >> prefetch functions in generic image will not work properly on all Arm CPUs. >> Patches [1.2] [1.3] [1.4] are to resolve the issue. >> >> For example when running Arm generic image (cache-line-size is defined as >> 128B in Makefile for all Arm architectures) on 64Byte cache-line-size CPUs, >> e.g., Neoverse-N1, Ampere altra, ThunderX2. >> >> [3] shows the prefetch macro definitions in VPP. Using CLIB_PREFETCH(), you >> can prefetch data resides in multiple cache lines. >> [4] shows some usage examples of the prefetch macros in VPP. When running >> Arm generic image (128B cache-line-size definition) on 64B cache-line CPUs >> (N1SDP for example), 4.2, 4.3 and 4.4 have issues. >> • For 4.2, the input for size parameter is 68. On N1SDP with 64B >> cache-line-size, there should be two prefetch instructions executed, but due >> to 68 is less than CLIB_CACHE_LINE_BYTES (128Byte definition in VPP), only >> the first prefetch instruction is executed. >> • For 4.3, if sizeof (ip0[0]) equals 68 or any other values larger than >> 64B, there will be the same issue as 4.2. >> • For 4.4, the code is trying to prefetch the first 128B of packet >> content. It assumes CLIB_CACHE_LINE_BYTES is 64B always. In Arm generic >> image, the input for size parameter is 256B, which will execute prefetches >> on unexpected cache-lines (expected prefetches on 64B-0 and 64B-1, but >> actually on B64-0 and B64-2) . >> Packet content: [64B-0][64B-1][64B-2][64B-3] >> >> Our proposal is introduce a macro CLIB_N_CACHELINE_BYTES via VPP multi-arch >> feature (check patch [1.2]), to reflect the runtime CPU cache-line-size in >> Arm generic image, so that the prefetch instructions can be executed >> correctly. >> Then for 4.4, we will need to modify the parameter for size, from >> 2*CLIB_CACHE_LINE_BYTES to 128B, to reflect the actual intention. >> >> Some additional macros [1.3] can be added for users to do prefetch based on >> number of cache-lines, besides number of bytes. >> >> Could you please suggest on the issue and proposal? >> >> [1]. Patches >> [1.1] build: support 128B/64B cache line size in Arm image, >> https://gerrit.fd.io/r/c/vpp/+/32968/2 >> [1.2] vppinfra: refactor prefetch macro, >> https://gerrit.fd.io/r/c/vpp/+/32969/3 >> [1.3] vppinfra: fix functions to prefetch single line, >> https://gerrit.fd.io/r/c/vpp/+/32970/2 >> [1.4] misc: correct prefetch macro usage, >> https://gerrit.fd.io/r/c/vpp/+/32971/3 >> [1.5] build: set 64B cache line size in Arm image, >> https://gerrit.fd.io/r/c/vpp/+/32972/2 >> >> [2]. Error message >> src/plugins/dpdk/device/init.c:1916:3: error: static_assert failed due to >> requirement '128 == 1 << 6' "DPDK RTE CACHE LINE SIZE does not match with >> 1<<CLIB_LOG2_CACHE_LINE_BYTES" >> STATIC_ASSERT (RTE_CACHE_LINE_SIZE == 1 << CLIB_LOG2_CACHE_LINE_BYTES, >> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> /home/lijian/tasks/plsremove/src/vppinfra/error_bootstrap.h:111:34: note: >> expanded from macro 'STATIC_ASSERT' >> #define STATIC_ASSERT(truth,...) _Static_assert(truth, __VA_ARGS__) >> ^ ~~~~~ >> 1 error generated. >> >> [3]. Prefetch macro definitions in VPP source code. >> ‘’’ >> #define _CLIB_PREFETCH(n,size,type) >> \ >> if ((size) > (n)*CLIB_CACHE_LINE_BYTES) >> \ >> __builtin_prefetch (_addr + (n)*CLIB_CACHE_LINE_BYTES, \ >> CLIB_PREFETCH_##type, >> \ >> /* locality */ 3); >> >> #define CLIB_PREFETCH(addr,size,type) \ >> do { >> \ >> void * _addr = (addr); \ >> >> \ >> ASSERT ((size) <= 4*CLIB_CACHE_LINE_BYTES); \ >> _CLIB_PREFETCH (0, size, type); \ >> _CLIB_PREFETCH (1, size, type); \ >> _CLIB_PREFETCH (2, size, type); \ >> _CLIB_PREFETCH (3, size, type); \ >> } while (0) >> ‘’’ >> >> [4] >> 4.1 CLIB_PREFETCH (p2->pre_data, 32, STORE); >> 4.2 CLIB_PREFETCH (p2->pre_data, 68, STORE); >> 4.3 CLIB_PREFETCH (b[4]->data, sizeof (ip0[0]), LOAD); >> 4.4 CLIB_PREFETCH (p2->data, 2*CLIB_CACHE_LINE_BYTES, LOAD); >> >> Thanks. > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#19768): https://lists.fd.io/g/vpp-dev/message/19768 Mute This Topic: https://lists.fd.io/mt/83991823/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-