On Sat, 25 May 2024 22:19:41 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 >> 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 >> 1.000265957x >> StringIndexOf.success 9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 >> 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Fix tests src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 239: > 237: // the needle size is less than 32 bytes, we default to a > 238: // byte-by-byte comparison (this will be rare). > 239: // Is this still true? src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 278: > 276: __ bind(L_nextCheck); > 277: __ testq(haystack_len_p, haystack_len_p); > 278: __ je(L_zeroCheckFailed); This check could be removed as the next check covers this one. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 360: > 358: __ push(rcx); > 359: __ push(r8); > 360: __ push(r9); No need to save/restore rcx/r8/r9 on windows platform as well. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 379: > 377: > 378: // Assume failure > 379: __ movq(rbp, -1); We are no more using rbp at return point so this is not needed now? src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 488: > 486: __ cmpq(r11, nMinusK); > 487: __ ja_b(L_return); > 488: __ movq(rax, r11); At places where we know that return value in r11 is correct, we dont need to checkRange so this could have its own label. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 566: > 564: // rbp: -1 > 565: // XMM_BYTE_0 - first element of needle broadcast > 566: // XMM_BYTE_K - last element of needle broadcast The only registers that are used as input in the switch case are: r14 = needle rbx = haystack rsi = haystack length (n) r12 = needle length (k) r10 = n - k (where k is needle length) XMM_BYTE_0 = first element of needle, broadcast XMM_BYTE_K = last element of needle, broadcast So we could only list these, making it easier to comprehend. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 578: > 576: // helper jumps to L_checkRangeAndReturn with a (-1) return value. > 577: big_case_loop_helper(false, 0, L_checkRangeAndReturn, L_loopTop, > mask, hsPtrRet, needleLen, > 578: needle, haystack, hsLength, tmp1, tmp2, tmp3, > rScratch, ae, _masm); If we run out of haystack instead of jumping to L_checkRangeAndReturn, we could directly jump to L_retrunError. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 597: > 595: > 596: // Need a lot of registers here to preserve state across > arrays_equals call > 597: This comment is no longer valid, could be removed. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 621: > 619: __ addq(hsPtrRet, index); > 620: __ movq(r11, hsPtrRet); > 621: __ jmp(L_checkRangeAndReturn); Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax). src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 660: > 658: // Haystack always copied to stack, so 32-byte reads OK > 659: // Haystack length < 32 > 660: // 10 < needle length < 32 Haystack length <= 32 10 < needle length <= 32 src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 721: > 719: false /* char */, knoreg); > 720: __ testl(rTmp3, rTmp3); > 721: __ jne(L_checkRangeAndReturn); Why do we have to checkRange here, would it not be always correct? It so we could return r11 directly (by moving into rax). src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1333: > 1331: > 1332: __ cmpq(nMinusK, 32); > 1333: __ jae_b(L_greaterThan32); Should this check be (n-k+1) >= 32? And so accordingly (n-k) >= 31 __ cmpq(nMinusK, 31); __ jae_b(L_greaterThan32); src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1382: > 1380: > 1381: __ testl(eq_mask, eq_mask); > 1382: __ je(noMatch); We are mixing operation width l and q here. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1750: > 1748: // r15 = unused > 1749: // XMM_BYTE_0 - first element of needle, broadcast > 1750: // XMM_BYTE_K - last element of needle, broadcast This comment is duplicated for both small haystack case and big haystack case, could be made a common comment. Also the only registers that are used as input in the switch case are: r14 = needle rbx = haystack rsi = haystack length (n) r12 = needle length (k) r10 = n - k (where k is needle length) XMM_BYTE_0 = first element of needle, broadcast XMM_BYTE_K = last element of needle, broadcast So we could only list these, making it easier to comprehend. src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1758: > 1756: // > 1757: // If a match is found, jump to L_checkRange, which ensures the > 1758: // matched needle is not past the end of the haystack. Another comment here would be useful: // The found index is returned in set_bit (r11). src/hotspot/cpu/x86/c2_stubGenerator_x86_64_string.cpp line 1810: > 1808: // XMM_BYTE_K - last element of needle, broadcast > 1809: // > 1810: // The haystack is > 32 bytes Good to mention some info about the return found index value in comment about how it is a combination of set_bit (r8), hs_ptr, and haystack. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617187600 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617193503 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617216424 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617218826 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617603927 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617318645 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617307443 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617536831 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617569308 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617575018 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617601913 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616424912 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1616427773 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617263035 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617267415 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1617273352