On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger <jo...@bec.de> wrote: > > On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote: > > The goal is that the new interfaces will be usable as an optimization > > by compilers if a program uses the return value of the non "eq" > > variant as a boolean. > > So I'm curious, but can you demonstrate that it can be implemented > notacibly faster than regular strcmp? Unlike for memcmp, I don't see an > obvious way to save any operations.
Strong point! I had been somewhat assuming we could make the same optimizations with `__memcmpeq` but there still needs to be some logic that tracks which comes first the mismatch or the null terminator. It's not quite as much as `memcmp` vs `__memcmpeq` but we can still save. Using the x86_64 AVX2 optimized implementation as reference: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD We can convert the general return method of checking equals + strlen from: ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %ecx incl %ecx jz L(keep_going) tzcntl %ecx, %ecx movzbl (%rdi, %rcx), %eax movzbl (%rsi, %rcx), %ecx subl %ecx, %eax vzeroupper ret ``` To ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm2 vpmovmskb %ymm2, %ecx incl %ecx jz L(keep_going) vpmovmskb %ymm1, %eax blsi %ecx, %ecx andn %eax, %ecx, %eax vzeroupper ret ``` Testing this with comparisons where mismatch or strlen in the first 32 bytes (common case) it's about the same throughput but ~20% reduction in latency. Another benefit is we can reuse this exact return logic throughout as memory offset is no longer required. This simplifies the page cross logic a great deal and will net us some serious code size reduction for the common usage of strcmp. I think though I was a bit over optimistic about the performance benefits as I was using `memcmp` vs `__memcmpeq` as a reference. I'll put together a patch for just `__strcmpeq` and post the results here. I think the wide-character versions have more expensive return value checks so if the character versions show a benefit we can expect it to translate. > > Joerg