On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger <jo...@bec.de> wrote:
>
> On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote:
> > The goal is that the new interfaces will be usable as an optimization
> > by compilers if a program uses the return value of the non "eq"
> > variant as a boolean.
>
> So I'm curious, but can you demonstrate that it can be implemented
> notacibly faster than regular strcmp? Unlike for memcmp, I don't see an
> obvious way to save any operations.

Strong point! I had been somewhat assuming we could make the same
optimizations with `__memcmpeq` but there still needs to be some
logic that tracks which comes first the mismatch or the null terminator.

It's not quite as much as `memcmp` vs `__memcmpeq` but we can
still save.

Using the x86_64 AVX2 optimized implementation as reference:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD

We can convert the general return method of checking equals + strlen from:

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm1
vpmovmskb %ymm1, %ecx
incl %ecx
jz L(keep_going)
tzcntl %ecx, %ecx
movzbl (%rdi, %rcx), %eax
movzbl (%rsi, %rcx), %ecx
subl %ecx, %eax
vzeroupper
ret
```

To

```
VMOVU (%rdi), %ymm0
VPCMPEQ (%rsi), %ymm0, %ymm1
VPCMPEQ %ymm0, %ymmZERO, %ymm2
vpandn %ymm1, %ymm2, %ymm2
vpmovmskb %ymm2, %ecx
incl %ecx
jz L(keep_going)
vpmovmskb %ymm1, %eax
blsi %ecx, %ecx
andn %eax, %ecx, %eax
vzeroupper
ret
```

Testing this with comparisons where mismatch or strlen in the first 32 bytes
(common case) it's about the same throughput but ~20% reduction in latency.

Another benefit is we can reuse this exact return logic throughout as memory
offset is no longer required. This simplifies the page cross logic a
great deal and
will net us some serious code size reduction for the common usage of strcmp.

I think though I was a bit over optimistic about the performance benefits as I
was using `memcmp` vs `__memcmpeq` as a reference. I'll put together
a patch for just `__strcmpeq` and post the results here. I think the
wide-character
versions have more expensive return value checks so if the character versions
show a benefit we can expect it to translate.



>
> Joerg

Reply via email to