On 10/17/18 4:03 PM, Florian Weimer wrote: > * Aaron Sawdey: > >> I've previously posted a patch to add vector/vsx inline expansion of >> strcmp/strncmp for the power8/power9 processors. Here are some of the >> other items I have in the pipeline that I hope to get into gcc9: >> >> * vector/vsx support for inline expansion of memcmp to non-loop code. >> This improves performance of small memcmp. >> * vector/vsx support for inline expansion of memcmp to loop code. This >> will close the performance gap for lengths of about 128-512 bytes >> by making the loop code closer to the performance of the library >> memcmp. >> * generate inline expansion to a loop for strcmp/strncmp. This closes >> another performance gap because strcmp/strncmp vector/vsx code >> currently generated is lots faster than the library call but we >> only generate comparison of 64 bytes to avoid exploding code size. >> Similar code in a loop would be compact and allow inline comparison >> of maybe the first 512 bytes inline before dumping to the library >> function. >> >> If anyone has any other input on the inline expansion work I've been >> doing for the rs6000 target, please let me know. > > The inline expansion of strcmp is problematic for valgrind: > > <https://bugs.kde.org/show_bug.cgi?id=386945>
I'm aware of this. One thing that will help is that I believe the vsx expansion for strcmp/strncmp does not have this problem, so with current gcc9 trunk the problem should only be seen if one of the strings is known at compile time to be less than 16 bytes, or if -mcpu=power7, or if vector/vsx is disabled. My position is that it is valgrind's problem if it doesn't understand correct code, but I also want valgrind to be a useful tool so I'm going to take a look and see if I can find a gpr sequence that is equally fast that it can understand. > We currently see around 0.5 KiB of instructions for each call to > strcmp. I find it hard to believe that this improves general system > performance except in micro-benchmarks. The expansion of strcmp where both arguments are strings of unknown length at compile time will compare 64 bytes then call strcmp on the remainder if no difference is found. If the gpr sequence is used (p7 or vec/vsx disabled) then the overhead is 91 instructions. If the p8 vsx sequence is used, the overhead is 59 instructions. If the p9 vsx sequence is used, then the overhead is 41 instructions. Yes, this will increase the instruction footprint. However the processors that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some of the comparison inline makes the common cases of strings being totally different, or identical and <= 64 bytes in length very much faster, and also avoiding the plt call means less pressure on the count cache and better branch prediction elsewhere. If you are aware of any real world code that is faster when built with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know so I can look at avoiding those situations. Aaron -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain