On Fri, 19 Jan 2024 19:03:31 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in columnar database filter operation. >> >> Implementation uses a lookup table to record permute indices. Table index is >> computed using >> mask argument of compress/expand operation. >> >> Following are the performance number of JMH micro included with the patch. >> >> >> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) >> >> Baseline: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516 >> ops/ms >> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844 >> ops/ms >> >> Withopt: >> Benchmark (size) Mode Cnt Score >> Error Units >> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072 >> ops/ms >> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229 >> ops/ms >> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384 >> ops/ms >> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ... > > Jatin Bhateja has updated the pull request incrementally with one additional > commit since the last revision: > > Modified code comment for clarity. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5305: > 5303: // value, this can potentially be used as a blending mask after > 5304: // compressing/expanding the source vector lanes. > 5305: vblendvps(dst, dst, xtmp, permv, vec_enc, false, xtmp1); If I am not wrong, the last argument in vblendps can be same as permv. That way we won't need xtmp1. i.e. the vblendps call can be modified as follows: vblendvps(dst, dst, xtmp, permv, vec_enc, false, permv); ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1460080650