On Fri, 19 Jan 2024 19:03:31 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Modified code comment for clarity. src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 985: > 983: for (int j = 0; j < 4; j++) { > 984: if (mask & (1 << j)) { > 985: __ emit_data64(j, relocInfo::none); This could be something like __ emit_data(2*j, relocInfo::none); __ emit_data(2*j+1, relocInfo::none) to have the double word masks in the table to begin with. Then we don't need the extra instructions in vector_compress_expand_avx2() to generate double word permute masks from long masks. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1460113427