On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 >> 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 >> 1.000265957x >> StringIndexOf.success 9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 >> 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Rearrange; add lambdas for clarity src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 383: > 381: { > 382: Label L_short; > 383: A comment here: // Broadcast the beginning of needle into a vector register. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 390: > 388: __ vpbroadcastb(byte_0, Address(needle, 0), > Assembler::AVX_256bit); > 389: } > 390: A comment here: // Broadcast the end of needle into a vector register. This step is not needed for single element needle. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 418: > 416: __ cmpq(haystack_len, 0x10); > 417: __ ja_b(L_moreThan16); > 418: An assert here to check for header size >= 16 would be good. Also a comment here would he good, something like: // Copy 16 or 32 bytes prior to haystack end onto stack // This will possibly including some object header bytes when haystack length is less than 16 or 32 bytes // Set the new haystack address to beginning of copied haystack on stack adjusting for extra bytes copied src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 498: > 496: > 497: // big_case_loop_helper will fall through to this point if one or > more potential matches are found > 498: // The mask will have a bitmask indicating the position of the > potential matches within the haystack If no potential match, which label does the big_case_loop_helper jump to? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 517: > 515: __C2 arrays_equals(false, haystackStart, firstNeedleCompare, > compLen, retval, rScratch, xmm_tmp3, xmm_tmp4, > 516: false /* char */, knoreg); > 517: __ testl(retval, retval); Since this is byte compare even for isU, the retval here could be a 64-bit quantity so the testl should be a testq. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 553: > 551: // Haystack always copied to stack, so 32-byte reads OK > 552: // Haystack length < 32 > 553: // 10 < needle length < 32 The comment below may need update as we come here for needle_len > OPT_NEEDLE_SIZE_MAX which is currently set as 5: // 10 < needle length < 32 src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 611: > 609: __C2 arrays_equals(false, rTmp, firstNeedleCompare, compLen, > rTmp3, rTmp2, xmm_tmp3, xmm_tmp4, false /* char */, > 610: knoreg); > 611: __ testl(rTmp3, rTmp3); Since this is byte compare even for isU, the rtmp3 here could be a 64-bit quantity so the testl should be a testq. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 629: > 627: > 628: __ bind(L_returnError); > 629: __ movq(rbp, -1); This could directly be rax instead of intermediate rbp and then moving from rbp to rax. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 633: > 631: > 632: __ bind(L_returnZero); > 633: __ xorl(rbp, rbp); This could directly be rax instead of intermediate rbp and then moving from rbp to rax. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592791718 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592792401 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592774634 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592866631 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592868501 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592880650 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592885514 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592892211 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592892329