On 10/17/18 4:03 PM, Florian Weimer wrote:
> * Aaron Sawdey:
> 
>> I've previously posted a patch to add vector/vsx inline expansion of
>> strcmp/strncmp for the power8/power9 processors. Here are some of the
>> other items I have in the pipeline that I hope to get into gcc9:
>>
>> * vector/vsx support for inline expansion of memcmp to non-loop code.
>>   This improves performance of small memcmp.
>> * vector/vsx support for inline expansion of memcmp to loop code. This
>>   will close the performance gap for lengths of about 128-512 bytes
>>   by making the loop code closer to the performance of the library
>>   memcmp.
>> * generate inline expansion to a loop for strcmp/strncmp. This closes
>>   another performance gap because strcmp/strncmp vector/vsx code
>>   currently generated is lots faster than the library call but we
>>   only generate comparison of 64 bytes to avoid exploding code size.
>>   Similar code in a loop would be compact and allow inline comparison
>>   of maybe the first 512 bytes inline before dumping to the library
>>   function.
>>
>> If anyone has any other input on the inline expansion work I've been
>> doing for the rs6000 target, please let me know.
> 
> The inline expansion of strcmp is problematic for valgrind:
> 
>   <https://bugs.kde.org/show_bug.cgi?id=386945>

I'm aware of this. One thing that will help is that I believe the vsx
expansion for strcmp/strncmp does not have this problem, so with
current gcc9 trunk the problem should only be seen if one of the strings is
known at compile time to be less than 16 bytes, or if -mcpu=power7, or
if vector/vsx is disabled. My position is that it is valgrind's problem
if it doesn't understand correct code, but I also want valgrind to be a
useful tool so I'm going to take a look and see if I can find a gpr
sequence that is equally fast that it can understand.

> We currently see around 0.5 KiB of instructions for each call to
> strcmp.  I find it hard to believe that this improves general system
> performance except in micro-benchmarks.

The expansion of strcmp where both arguments are strings of unknown
length at compile time will compare 64 bytes then call strcmp on the
remainder if no difference is found. If the gpr sequence is used (p7
or vec/vsx disabled) then the overhead is 91 instructions. If the
p8 vsx sequence is used, the overhead is 59 instructions. If the p9
vsx sequence is used, then the overhead is 41 instructions.

Yes, this will increase the instruction footprint. However the processors
that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
of the comparison inline makes the common cases of strings being totally
different, or identical and <= 64 bytes in length very much faster, and
also avoiding the plt call means less pressure on the count cache and
better branch prediction elsewhere.

If you are aware of any real world code that is faster when built
with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
so I can look at avoiding those situations.

  Aaron

-- 
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

Reply via email to