On Sat, 4 May 2024 19:35:21 GMT, Scott Gibbons <sgibb...@openjdk.org> wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark Score >> Latest >> StringIndexOf.advancedWithMediumSub 343.573 317.934 >> 0.925375393x >> StringIndexOf.advancedWithShortSub1 1039.081 1053.96 >> 1.014319384x >> StringIndexOf.advancedWithShortSub2 55.828 110.541 >> 1.980027943x >> StringIndexOf.constantPattern 9.361 11.906 >> 1.271872663x >> StringIndexOf.searchCharLongSuccess 4.216 4.218 >> 1.000474383x >> StringIndexOf.searchCharMediumSuccess 3.133 3.216 >> 1.02649218x >> StringIndexOf.searchCharShortSuccess 3.76 3.761 >> 1.000265957x >> StringIndexOf.success 9.186 >> 9.713 1.057369911x >> StringIndexOf.successBig 14.341 46.343 >> 3.231504079x >> StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 >> 1.953814533x >> StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 >> 1.006629895x >> StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 >> 0.977049957x >> StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 >> 0.967675646x >> StringIndexOfChar.latin1_Short_String 7132.541 >> 6863.359 0.962260014x >> StringIndexOfChar.latin1_Short_char 16013.389 16162.437 >> 1.009307711x >> StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 >> 1.999915517x >> StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 >> 0.987938803 > > Scott Gibbons has updated the pull request incrementally with one additional > commit since the last revision: > > Rearrange; add lambdas for clarity src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4492: > 4490: > 4491: // Compare char[] or byte[] arrays aligned to 4 bytes or substrings. > 4492: void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1, I liked the old style better, fewer longer lines.. same for rest of the changes in this file. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4594: > 4592: #endif //_LP64 > 4593: bind(COMPARE_WIDE_VECTORS); > 4594: vmovdqu(vec1, Address(ary1, limit, create a local scale variable instead of ternary operators. Used several times. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 4250: > 4248: generate_chacha_stubs(); > 4249: > 4250: if ((UseAVX == 2) && EnableX86ECoreOpts && > VM_Version::supports_avx2()) { Just `if (EnableX86ECoreOpts)`? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 391: > 389: } > 390: > 391: __ cmpq(needle_len, isU ? 2 : 1); Can we remove this comparison? i.e. - broadcast first and last character unconditionally (same character). Or - move broadcasts 'down' into individual cases.. There is already specialized code to handle needle of size 1.. This adds extra pathlength. (Will we actually call this intrinsic for needle_size==1? Assume length>=2?) src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1365: > 1363: // Compare first byte of needle to haystack > 1364: vpcmpeq(cmp_0, byte_0, Address(haystack, 0), > Assembler::AVX_256bit); > 1365: if (size != (isU ? 2 : 1)) { `if (size != scale)` Though in this case, `elem_size` might hold more meaning. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1372: > 1370: > 1371: if (bytesToCompare > 2) { > 1372: if (size > (isU ? 4 : 2)) { `if (size > 2*scale)`? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1373: > 1371: if (bytesToCompare > 2) { > 1372: if (size > (isU ? 4 : 2)) { > 1373: if (doEarlyBailout) { Is there a big perf difference when `doEarlyBailout` is enabled? And/or just for this function? (i.e. removing `doEarlyBailout` in this function will mean less pathlength. Feels like a few extra vpands should be cheap enough.) src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1469: > 1467: > 1468: if (isU && (size & 1)) { > 1469: __ emit_int8(0xcc); This should also be an `assert()` to catch this at compile-time. src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1633: > 1631: if (isU) { > 1632: if ((size & 1) != 0) { > 1633: __ emit_int8(0xcc); Compile-time assert to ensure this code is never called instead? src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1889: > 1887: // r13 = (needle length - 1) > 1888: // r14 = &needle > 1889: // r15 = unused There is quite a bit of redundancy in register usage. Its not incorrect, but looks odd. Not clear if this duplication can easily be removed (or if/why needed). // rbx = &haystack // rdi = &haystack // rdx = &needle // r14 = &needle // rcx = haystack length // rsi = haystack length // r12 = needle length // r13 = (needle length - 1) // r10 = hs_len - needle len // rbp = -1 // rax = unused // r11 = unused // r8 = unused // r9 = unused // r15 = unused (Could this comment be out-of-sync with the code? Looks like only rbx, r14 and temps out of unused registers are used few lines down) src/hotspot/cpu/x86/stubGenerator_x86_64_string.cpp line 1950: > 1948: // r13 = (needle length - 1) > 1949: // r14 = &needle > 1950: // r15 = unused Same as for the small case ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592834449 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592838385 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1592831339 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599131482 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599146451 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599144855 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599143784 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599151000 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599204083 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599209564 PR Review Comment: https://git.openjdk.org/jdk/pull/16753#discussion_r1599213635