Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Emanuel Peter Thu, 04 Jan 2024 06:32:21 -0800

On Thu, 4 Jan 2024 05:39:01 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
>> instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is 
>> computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark                                 (size)   Mode  Cnt    Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  142.767        
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   71.436        
>>   ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   35.992        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  182.151        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2   91.096        
>>   ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2   44.757        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  184.099        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt    2   91.981        
>>   ops/ms
>> ColumnFilterBenchmark.filterIntColumn       4096  thrpt    2   45.170        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      1024  thrpt    2  148.017        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      2047  thrpt    2   73.516        
>>   ops/ms
>> ColumnFilterBenchmark.filterLongColumn      4096  thrpt    2   36.844        
>>   ops/ms
>> 
>> Withopt:
>> Benchmark                                 (size)   Mode  Cnt     Score   
>> Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  2051.707       
>>    ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   914.072       
>>    ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   489.898       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  5324.195       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2  2587.229       
>>    ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2  1278.665       
>>    ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  4149.384       
>>    ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Updating copyright year of modified files.

@jatin-bhateja this looks like a great improvement!
I have a few questions and requests below.

FYI, this feels very inspiring. I'm dreaming of a day where we could do this 
filtering in the auto-vectorizer directly.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5303:

> 5301:     // Blend the results with zero vector using permute vector as mask, 
> its
> 5302:     // non-participating lanes holds a -1 value.
> 5303:     vblendvps(dst, dst, xtmp, permv, vec_enc);

would you mind adding a few more comments to explain what happens here? I would 
also really appreciate more expressive register/variable names.

I think you are basically converting the `mask` to a permutation `permv`, by a 
lookup in the table.
Then you permute the `src` and blend it with a -1 vector, so that the unused 
(high) lanes are -1.


xtmp -> min_one
rtmp -> table_index
rscratch -> table_adr

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307:

> 5305:     assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5306:     vmovmskpd(rtmp, mask, vec_enc);
> 5307:     shlq(rtmp, 5);

Might this need to be 6? If I understand right, then you want to have a 64bit 
stride, hence 2^6, right?
If that is correct, then this did not show in your tests, and you need a 
regression test anyway.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 488:

> 486:                           KRegister ktmp1, int vec_enc);
> 487: 
> 488: 

Remove useless empty line

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 957:

> 955:   __ align(CodeEntryAlignment);
> 956:   StubCodeMark mark(this, "StubRoutines", stub_name);
> 957:   address start = __ pc();

Could you please add some comments here why you are filling the data like this?
Presumably, you are emitting 32 bits and 64 bits respectively, right? So the 
cells have different size, correct?

test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java 
line 76:

> 74:         longinCol = new long[size];
> 75:         longoutCol = new long[size];
> 76:         lpivot = size / 2;

I'd be interested to see what happens if you move up or down the "density" of 
elements that you accept. Would the simple branch prediction be faster if the 
density is low enough, i.e. we almost take no element.

Though maybe that is not compiler problem but a user-problem?

test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java 
line 94:

> 92:            IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
> 93:            VectorMask<Integer> pred = vec.compare(VectorOperators.GT, 
> ipivot);
> 94:            vec.compress(pred).intoArray(intoutCol, j);

Could there be equivalent `expand` tests?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1804121213
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441749005
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441761312
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441724949
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441759984
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441753158
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441729256

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Reply via email to