Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

Scott Gibbons Sat, 23 Mar 2024 09:51:38 -0700

On Sat, 23 Mar 2024 09:57:49 GMT, Francesco Nigro <d...@openjdk.org> wrote:


>> Scott Gibbons has updated the pull request with a new target base due to a 
>> merge or a rebase. The pull request now contains 46 commits:
>> 
>>  - Merge branch 'openjdk:master' into indexof
>>  - Cleaned up, ready for review
>>  - Pre-cleanup code
>>  - Add JMH. Add 16-byte compares to arrays_equals
>>  - Better method for mask creation
>>  - Merge branch 'openjdk:master' into indexof
>>  - Most cleanup done.
>>  - Remove header dependency
>>  - Works - needs cleanup
>>  - Passes tests.
>>  - ... and 36 more: https://git.openjdk.org/jdk/compare/bc739639...e079fc12
>
> Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, 
> which is manually loop unrolling the head processing (first < 8 bytes) to 
> artificially make sure the branch predictor got different branches to care 
> AND JIT won't make it wrong. We have measured (I can provide a link of the 
> benchmark and results, If you are interested) that it delivers a much better 
> performance on tiny strings and makes a smoother  degradation vs perfectly 
> aligned string length as well. Clearly this tends to be much visible if the 
> input strings have shuffled delimiter positions, to make the branch 
> prediction misses more relevant.

Hi @franz1981.  I'd be interested in seeing the code.  I looked 
[here](https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/common/src/main/java/io/netty/util/AsciiString.java#L696),
 but that just looks like a naïve implementation, so I must be missing 
something.

This code uses vector compares to search for the first byte of the substring 
(needle) in 32-byte chunks of the input string (haystack).  However, once a 
character is found, it also checks for a match corresponding to the last byte 
of the needle within the haystack before doing the full needle comparison.  
This is also in 32-byte chunks.  That is, we load a vector register with 32 (or 
16 for wide chars) copies of the first byte of the needle, and another with 
copies of the last byte of the needle.  The first comparison is done at the 
start of the haystack, giving us indication of the presence and index of the 
first byte.  We then compare the last byte of the needle at the haystack 
indexed at needle length - 1 (i.e., the last byte).  This tells us if the last 
byte of the needle appears in the correct relative position within the 
haystack.  ANDing these results tells us whether or not we have a candidate 
needle within the haystack, as well as the position of the needle.  Only then 
do we
  do a full char-by-char comparison of the needle to the haystack (this is also 
done with vector instructions when possible).

A lot of this code is there to handle the cases of small-ish needles within 
small-ish haystacks, where 32-byte reads are not possible (due to reading past 
the end of the strings, possibly generating a page fault).  I handle less than 
32-byte haystacks by copying the haystack to the stack, where I can be assured 
that 32-byte reads will be possible.  So there are special cases for haystack < 
32 bytes with needle sizes <= 10 (arbitrary) and one for haystacks > 32 bytes 
and needle size <= 10.  I also added a section for haystacks <= 32 bytes and 
needle sizes < 5, which seem to be the most common cases.  This path copies the 
haystack to the stack (a single vector read & write) and up to 5 vector 
comparisons, one for each byte of the needle, with no branching or looping.

I'd be very interested in seeing the Netty SWAR implementation.  Thanks for the 
comment.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16753#issuecomment-2016544706

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v14]

Reply via email to