On Mon, 10 Feb 2025 09:26:32 GMT, Galder Zamarreño <gal...@openjdk.org> wrote:

>> @eastig is helping with the results on aarch64, so I will verify the numbers 
>> in same way done below for x86_64 once he provides me with the results.
>> 
>> Here is a summary of the benchmarking results I'm seeing on x86_64 (I will 
>> push an update that just merges the latest master shortly).
>> 
>> First I will go through the results of `MinMaxVector`. This benchmark 
>> computes throughput by default so the higher the number the better.
>> 
>> # MinMaxVector AVX-512
>> 
>> Following are results with AVX-512 instructions:
>> 
>> Benchmark                       (probability)  (range)  (seed)  (size)   
>> Mode  Cnt   Baseline     Patch   Units
>> MinMaxVector.longClippingRange            N/A       90       0    1000  
>> thrpt    4    834.127  3688.961  ops/ms
>> MinMaxVector.longClippingRange            N/A      100       0    1000  
>> thrpt    4   1147.010  3687.721  ops/ms
>> MinMaxVector.longLoopMax                   50      N/A     N/A    2048  
>> thrpt    4   1126.718  1072.812  ops/ms
>> MinMaxVector.longLoopMax                   80      N/A     N/A    2048  
>> thrpt    4   1070.921  1070.538  ops/ms
>> MinMaxVector.longLoopMax                  100      N/A     N/A    2048  
>> thrpt    4    510.483  1073.081  ops/ms
>> MinMaxVector.longLoopMin                   50      N/A     N/A    2048  
>> thrpt    4    935.658  1016.910  ops/ms
>> MinMaxVector.longLoopMin                   80      N/A     N/A    2048  
>> thrpt    4   1007.410   933.774  ops/ms
>> MinMaxVector.longLoopMin                  100      N/A     N/A    2048  
>> thrpt    4    536.582  1017.337  ops/ms
>> MinMaxVector.longReductionMax              50      N/A     N/A    2048  
>> thrpt    4    967.288   966.945  ops/ms
>> MinMaxVector.longReductionMax              80      N/A     N/A    2048  
>> thrpt    4    967.327   967.382  ops/ms
>> MinMaxVector.longReductionMax             100      N/A     N/A    2048  
>> thrpt    4    849.689   967.327  ops/ms
>> MinMaxVector.longReductionMin              50      N/A     N/A    2048  
>> thrpt    4    966.323   967.275  ops/ms
>> MinMaxVector.longReductionMin              80      N/A     N/A    2048  
>> thrpt    4    967.340   967.228  ops/ms
>> MinMaxVector.longReductionMin             100      N/A     N/A    2048  
>> thrpt    4    880.921   967.233  ops/ms
>> 
>> 
>> ### `longReduction[Min|Max]` performance improves slightly when probability 
>> is 100
>> 
>> Without the patch the code uses compare instructions:
>> 
>> 
>>    7.83%  ││││ │││↗  │           0x00007f4f700fb305:   imulq         $0xb, 
>> 0x20(%r14, %r8, 8), %rdi
>>           ││││ │││...
>
>> At 100% probability baseline fails to vectorize because it observes a 
>> control flow. This control flow is not the one you see in min/max 
>> implementations, but this is one added by HotSpot as a result of the JIT 
>> profiling. It observes that one branch is always taken so it optimizes for 
>> that, and adds a branch for the uncommon case where the branch is not taken.
> 
> I've dug further into this to try to understand how the baseline hotspot code 
> works, and the explanation above is not entirely correct. Let's look at the 
> IR differences between say 100% vs 80% branch situations.
> 
> At branch 80% you see:
> 
>  1115  CountedLoop  === 1115 598 463  [[ 1101 1115 1116 1118 451 594 ]] inner 
> stride: 2 main of N1115 strip mined !orig=[599],[590],[307] !jvms: 
> MinMaxVector::longLoopMax @ bci:10 (line 236) 
> MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 
> 124)
> 
>   692  LoadL  === 1083 1101 393  [[ 747 ]]  @long[int:>=0] 
> (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does 
> not depend only on test, unknown control) !orig=[395] !jvms: 
> MinMaxVector::longLoopMax @ bci:26 (line 236) 
> MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 
> 124)
>   651  LoadL  === 1095 1101 355  [[ 747 ]]  @long[int:>=0] 
> (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9; #long (does 
> not depend only on test, unknown control) !orig=[357] !jvms: 
> MinMaxVector::longLoopMax @ bci:20 (line 236) 
> MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 
> 124)
>   747  MaxL  === _ 651 692  [[ 451 ]]  !orig=[608],[416] !jvms: Math::max @ 
> bci:11 (line 2037) MinMaxVector::longLoopMax @ bci:27 (line 236) 
> MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 
> 124)
> 
>   451  StoreL  === 1115 1101 449 747  [[ 1116 454 911 ]]  @long[int:>=0] 
> (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=9;  Memory: 
> @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any 
> *, idx=9; !orig=1124 !jvms: MinMaxVector::longLoopMax @ bci:30 (line 236) 
> MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 (line 
> 124)
> 
>   594  CountedLoopEnd  === 1115 593  [[ 1123 463 ]] [lt] P=0.999731, 
> C=780799.000000 !orig=[462] !jvms: MinMaxVector::longLoopMax @ bci:7 (line 
> 235) MinMaxVector_longLoopMax_jmhTest::longLoopMax_thrpt_jmhStub @ bci:19 
> (line 124)
> 
> 
> You see the counted loop with the LoadL for array loads and MaxL consuming 
> those. The StoreL is for array assignment (I think).
> 
> At branch 100% you see:
> 
> 
>  ...

@galderz Thanks for all the explanations, that's really helpful 😊 

**Discussion**

- AVX512: only imprivements.
  - Expecially with probability 100, where before we used the bytecode, which 
would then create an `unstable_if` with uncommon trap. That meant we could not 
re-discover the CMove / Max later in the IR. Now that we never inline the 
bytecode, and just intrinsify directly, we can use `vpmax` and that is faster.
  - Ah, maybe that was all incorrect, though it sounded reasonable. You seem to 
suggest that we actually did use to inline both branches, but that the issue 
was that `PhaseIdealLoop::conditional_move` does not like extreme 
probabilities, and so it did not convert 100% cases to CMove, and so it did not 
use to vectorize. Right. Getting the probability cutoff just right it a little 
tricky there, and the precise number can seem strange. But that's a discussion 
for another day.
  - The reduction case is only improved slightly... at least. Maybe we can 
further improve the throughput with 
[this](https://bugs.openjdk.org/browse/JDK-8345245) later on.

 - AVX2: mixed results
   - `longReductionMax/Min`: vector max / min is not implemented. We should 
investigate why.
   - It seems like the `MaxVL` and `MinVL` (e.g. `vpmaxsq`) instructions are 
only implemented directly for AVX512, see 
[this](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4669,2611&text=max_epi64).
   - As you suggested @galderz we could consider implementing it via `cmove` in 
the backend for `AVX2` and maybe lower. Maybe we can talk with @jatin-bhateja 
about this. That would probably already be worth it on its own, in a separate 
RFE. Because I would suspect it could give speedup in the non 100% cases as 
well. Maybe this would even have to be an RFE that makes it in first, so we 
don't have regressions here?
   - But even still: just intfinsifying should not get us a regression, because 
there will always be cases where the auto-vectorizer fails, and so the scalar 
code should not be slower with your patch than on master, right? So we need to 
investigate this scalar issue as well.

- VectorReduction2.WithSuperword on AVX-512
  - `long[Min|Max]Simple performance drops considerably`. Yes, this case is not 
yet supposed to vectorize, I'm working on that - it is the issue with "simple" 
reductions, i.e. those that do no work other than reduce. Our current reduction 
heuristic thinks these are not profitable to vectorize - but that is wrong in 
almost all cases. You even filed an issue for that a while back ;) see 
https://bugs.openjdk.org/browse/JDK-8345044 and related issues. We could bite 
the bullet on this, knowing that I'm working on it and it will probably fix 
that issue, or we just wait a little here. Let's discuss.

- VectorReduction2.NoSuperword on AVX-512 machine
  - Hmm, ok. So we seem to realize that the scalar case is slower with your 
patch in some cases, because now we have a `cmove` on the critical path, and 
previously we could just predict the branches, which was faster. Interesting 
that the number of other instructions has an effect here as well, you seem to 
see a speedup with the "big" benchmarks, but the "small" and "dot" benchmarks 
are slower. This is surprising. It would be great if we understood why it 
behaves this way.

**Summary**

Wow, things are more complicated than I would have thought, I hope you are not 
too discouraged 🙈 

We seem to have these issues, maybe there are more:
- AVX2 does not have long-vector-min/max implemented. That can be done in a 
separate RFE.
- Simple reductions do not vectorize, known issue see 
https://bugs.openjdk.org/browse/JDK-8345044, I'm working on that.
- Scalar reductions are slower with your patch for extreme probabilities. 
Before, they were done with branches, and branch prediction was fast. Now with 
cmove or max instructions, the critical path is longer, and that makes things 
slow. Maybe this could be alleviated by reordering / reassociating the 
reduction path, see [JDK-8345245](https://bugs.openjdk.org/browse/JDK-8345245). 
Alternatively, we could convert the `cmove` back to a branch, but for that we 
would probably need to know the branching probability, which we now do not have 
any more, right? Tricky. This seems the real issue we need to address and 
discuss.

@galderz What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2656328729

Reply via email to