Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Galder Zamarreño Thu, 20 Feb 2025 03:04:44 -0800

On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter <epe...@openjdk.org> wrote:


>> To make it more explicit: implementing long min/max in ad files as cmp will 
>> likely remove all the 100% regressions that are observed here. I'm going to 
>> repeat the same MinMaxVector int min/max reduction test above with the ad 
>> changes @rwestrel suggested to see what effect they have.
>
> @galderz I think we will have the same issue with both `int` and `long`: As 
> far as I know, it is really a difficult problem to decide at compile-time if 
> a `cmove` or `branch` is the better choice. I'm not sure there is any 
> heuristic for which you will not find a micro-benchmark where the heuristic 
> made the wrong choice.
> 
> To my understanding, these are the factors that impact the performance:
> - `cmove` requires all inputs to complete before it can execute, and it has 
> an inherent latency of a cycle or so itself. But you cannot have any branch 
> mispredictions, and hence no branch misprediction penalties (i.e. when the 
> CPU has to flush out the ops from the wrong branch and restart at the branch).
> - `branch` can hide some latencies, because we can already continue with the 
> branch that is speculated on. We do not need to wait for the inputs of the 
> comparison to arrive, and we can already continue with the speculated 
> resulting value. But if the speculation is ever wrong, we have to pay the 
> misprediction penalty.
> 
> In my understanding, there are roughly 3 scenarios:
> - The branch probability is so extreme that the branch predictor would be 
> correct almost always, and so it is profitable to do branching code.
> - The branching probability is somewhere in the middle, and the branch is not 
> predictable. Branch mispredictions are very expensive, and so it is better to 
> use `cmove`.
> - The branching probability is somewhere in the middle, but the branch is 
> predictable (e.g. swapps back and forth). The branch predictor will have 
> almost no mispredictions, and it is faster to use branching code.
> 
> Modeling this precisely is actually a little complex. You would have to know 
> the cost of the `cmove` and the `branching` version of the code. That depends 
> on the latency of the inputs, and the outputs: does the `cmove` dramatically 
> increase the latency on the critical path, and `branching` could hide some of 
> that latency? And you would have to know how good the branch predictor is, 
> which you cannot derive from the branching probability of our profiling (at 
> least not when the probabilities are in the middle, and you don't know if it 
> is a random or predictable pattern).
> 
> If we can find a perfect heuristic - that would be fantastic ;)
> 
> If we cannot find a perfect heuristic, then we should think about what are 
> the most "common" or "relevant" scenarios, I think.
> 
> But let's discuss all of this in a call / offline :)

FYI @eme64 @chhagedorn @rwestrel 

Since we know that vectorization does not always kick in, there was a worry if 
scalar fallbacks would heavily suffer with the work included in this PR to add 
long intrinsic for min/max. Looking at the same scenarios with int (read my 
comments https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851 and 
https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644), it looks 
clear that the same kind of regressions are also present there. So, if those 
int scalar regressions were not a problem when int min/max intrinsic was added, 
I would expect the same to apply to long.

Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was 
trying to think what could be causing this. I thought maybe it's due to the int 
min/max backend, which is implemented in platform specific way, vs the long 
min/max backend which relies on platform independent macro expansion. But if 
that theory was true, I would expect the same behaviour with int max vs long 
max, but that's not the case. It seems odd to only see this difference with min.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671163220

Re: RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v11]

Reply via email to