On Thu, 4 Jan 2024 05:39:01 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Updating copyright year of modified files. @jatin-bhateja this looks like a great improvement! I have a few questions and requests below. FYI, this feels very inspiring. I'm dreaming of a day where we could do this filtering in the auto-vectorizer directly. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5303: > 5301: // Blend the results with zero vector using permute vector as mask, > its > 5302: // non-participating lanes holds a -1 value. > 5303: vblendvps(dst, dst, xtmp, permv, vec_enc); would you mind adding a few more comments to explain what happens here? I would also really appreciate more expressive register/variable names. I think you are basically converting the `mask` to a permutation `permv`, by a lookup in the table. Then you permute the `src` and blend it with a -1 vector, so that the unused (high) lanes are -1. xtmp -> min_one rtmp -> table_index rscratch -> table_adr src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307: > 5305: assert(bt == T_LONG || bt == T_DOUBLE, ""); > 5306: vmovmskpd(rtmp, mask, vec_enc); > 5307: shlq(rtmp, 5); Might this need to be 6? If I understand right, then you want to have a 64bit stride, hence 2^6, right? If that is correct, then this did not show in your tests, and you need a regression test anyway. src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 488: > 486: KRegister ktmp1, int vec_enc); > 487: > 488: Remove useless empty line src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 957: > 955: __ align(CodeEntryAlignment); > 956: StubCodeMark mark(this, "StubRoutines", stub_name); > 957: address start = __ pc(); Could you please add some comments here why you are filling the data like this? Presumably, you are emitting 32 bits and 64 bits respectively, right? So the cells have different size, correct? test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 76: > 74: longinCol = new long[size]; > 75: longoutCol = new long[size]; > 76: lpivot = size / 2; I'd be interested to see what happens if you move up or down the "density" of elements that you accept. Would the simple branch prediction be faster if the density is low enough, i.e. we almost take no element. Though maybe that is not compiler problem but a user-problem? test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 94: > 92: IntVector vec = IntVector.fromArray(ispecies, intinCol, i); > 93: VectorMask<Integer> pred = vec.compare(VectorOperators.GT, > ipivot); > 94: vec.compress(pred).intoArray(intoutCol, j); Could there be equivalent `expand` tests? ------------- Changes requested by epeter (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1804121213 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441749005 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441761312 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441724949 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441759984 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441753158 PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441729256