On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:

>> Re-write the IndexOf code without the use of the pcmpestri instruction, only 
>> using AVX2 instructions.  This change accelerates String.IndexOf on average 
>> 1.3x for AVX2.  The benchmark numbers:
>> 
>> 
>> Benchmark                                                   Score            
>> Latest          
>> StringIndexOf.advancedWithMediumSub   343.573                317.934         
>> 0.925375393x
>> StringIndexOf.advancedWithShortSub1    1039.081              1053.96         
>> 1.014319384x
>> StringIndexOf.advancedWithShortSub2        55.828            110.541         
>> 1.980027943x
>> StringIndexOf.constantPattern                        9.361           11.906  
>>         1.271872663x
>> StringIndexOf.searchCharLongSuccess          4.216           4.218           
>> 1.000474383x
>> StringIndexOf.searchCharMediumSuccess        3.133           3.216           
>> 1.02649218x
>> StringIndexOf.searchCharShortSuccess 3.76                    3.761           
>> 1.000265957x
>> StringIndexOf.success                                        9.186           
>> 9.713           1.057369911x
>> StringIndexOf.successBig                           14.341            46.343  
>>         3.231504079x
>> StringIndexOfChar.latin1_AVX2_String   6220.918              12154.52        
>>         1.953814533x
>> StringIndexOfChar.latin1_AVX2_char     5503.556              5540.044        
>>         1.006629895x
>> StringIndexOfChar.latin1_SSE4_String   6978.854              6818.689        
>>         0.977049957x
>> StringIndexOfChar.latin1_SSE4_char     5657.499              5474.624        
>>         0.967675646x
>> StringIndexOfChar.latin1_Short_String          7132.541              
>> 6863.359                0.962260014x
>> StringIndexOfChar.latin1_Short_char  16013.389             16162.437         
>> 1.009307711x
>> StringIndexOfChar.latin1_mixed_String          7386.123            14771.622 
>>         1.999915517x
>> StringIndexOfChar.latin1_mixed_char    9901.671              9782.245        
>>         0.987938803
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fix tests

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239:

> 237:   //  the needle size is less than 32 bytes, we default to a
> 238:   //  byte-by-byte comparison (this will be rare).
> 239:   //

Is this still true?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278:

> 276:   __ bind(L_nextCheck);
> 277:   __ testq(haystack_len_p, haystack_len_p);
> 278:   __ je(L_zeroCheckFailed);

This check could be removed as the next check covers this one.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360:

> 358:   __ push(rcx);
> 359:   __ push(r8);
> 360:   __ push(r9);

No need to save/restore rcx/r8/r9 on windows platform as well.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379:

> 377: 
> 378:   // Assume failure
> 379:   __ movq(rbp, -1);

We are no more using rbp at return point so this is not needed now?

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488:

> 486:   __ cmpq(r11, nMinusK);
> 487:   __ ja_b(L_return);
> 488:   __ movq(rax, r11);

At places where we know that return value in r11 is correct, we dont need to 
checkRange so this could have its own label.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566:

> 564:     //  rbp: -1
> 565:     //  XMM_BYTE_0 - first element of needle broadcast
> 566:     //  XMM_BYTE_K - last element of needle broadcast

The only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578:

> 576:     // helper jumps to L_checkRangeAndReturn with a (-1) return value.
> 577:     big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, 
> mask, hsPtrRet, needleLen,
> 578:                          needle, haystack, hsLength, tmp1, tmp2, tmp3, 
> rScratch, ae, _masm);

If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could 
directly jump to  L_retrunError.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597:

> 595: 
> 596:     // Need a lot of registers here to preserve state across 
> arrays_equals call
> 597: 

This comment is no longer valid, could be removed.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621:

> 619:     __ addq(hsPtrRet, index);
> 620:     __ movq(r11, hsPtrRet);
> 621:     __ jmp(L_checkRangeAndReturn);

Why do we have to checkRange here, would it not be always correct? It so we 
could return r11 directly (by moving into rax).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660:

> 658:   //  Haystack always copied to stack, so 32-byte reads OK
> 659:   //  Haystack length < 32
> 660:   //  10 < needle length < 32

Haystack length <= 32
10 < needle length <= 32

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721:

> 719:                        false /* char */, knoreg);
> 720:     __ testl(rTmp3, rTmp3);
> 721:     __ jne(L_checkRangeAndReturn);

Why do we have to checkRange here, would it not be always correct? It so we 
could return r11 directly (by moving into rax).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333:

> 1331: 
> 1332:   __ cmpq(nMinusK, 32);
> 1333:   __ jae_b(L_greaterThan32);

Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31
 __ cmpq(nMinusK, 31);
 __ jae_b(L_greaterThan32);

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1382:

> 1380: 
> 1381:   __ testl(eq_mask, eq_mask);
> 1382:   __ je(noMatch);

We are mixing operation width l and q here.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750:

> 1748:     //  r15 = unused
> 1749:     //  XMM_BYTE_0 - first element of needle, broadcast
> 1750:     //  XMM_BYTE_K - last element of needle, broadcast

This comment is duplicated for both small haystack case and big haystack case, 
could be made a common comment. 
Also the only registers that are used as input in the switch case are:
r14 = needle
rbx = haystack
rsi = haystack length (n)
r12 = needle length (k)
r10 = n - k (where k is needle length)
XMM_BYTE_0 = first element of needle, broadcast
XMM_BYTE_K = last element of needle, broadcast
So we could only list these, making it easier to comprehend.

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1758:

> 1756:     //
> 1757:     // If a match is found, jump to L_checkRange, which ensures the
> 1758:     // matched needle is not past the end of the haystack.

Another comment here would be useful:
// The found index is returned in set_bit (r11).

src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1810:

> 1808:     //  XMM_BYTE_K - last element of needle, broadcast
> 1809:     //
> 1810:     //  The haystack is > 32 bytes

Good to mention some info about the return found index value in comment about 
how it is a combination of set_bit (r8), hs_ptr, and haystack.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617187600
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617193503
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617216424
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617218826
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617603927
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617318645
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617307443
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617536831
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617569308
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617575018
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617601913
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616424912
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616427773
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617263035
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617267415
PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617273352

Reply via email to