On Tue, 18 Feb 2025 08:43:38 GMT, Emanuel Peter <epe...@openjdk.org> wrote:
>> To make it more explicit: implementing long min/max in ad files as cmp will >> likely remove all the 100% regressions that are observed here. I'm going to >> repeat the same MinMaxVector int min/max reduction test above with the ad >> changes @rwestrel suggested to see what effect they have. > > @galderz I think we will have the same issue with both `int` and `long`: As > far as I know, it is really a difficult problem to decide at compile-time if > a `cmove` or `branch` is the better choice. I'm not sure there is any > heuristic for which you will not find a micro-benchmark where the heuristic > made the wrong choice. > > To my understanding, these are the factors that impact the performance: > - `cmove` requires all inputs to complete before it can execute, and it has > an inherent latency of a cycle or so itself. But you cannot have any branch > mispredictions, and hence no branch misprediction penalties (i.e. when the > CPU has to flush out the ops from the wrong branch and restart at the branch). > - `branch` can hide some latencies, because we can already continue with the > branch that is speculated on. We do not need to wait for the inputs of the > comparison to arrive, and we can already continue with the speculated > resulting value. But if the speculation is ever wrong, we have to pay the > misprediction penalty. > > In my understanding, there are roughly 3 scenarios: > - The branch probability is so extreme that the branch predictor would be > correct almost always, and so it is profitable to do branching code. > - The branching probability is somewhere in the middle, and the branch is not > predictable. Branch mispredictions are very expensive, and so it is better to > use `cmove`. > - The branching probability is somewhere in the middle, but the branch is > predictable (e.g. swapps back and forth). The branch predictor will have > almost no mispredictions, and it is faster to use branching code. > > Modeling this precisely is actually a little complex. You would have to know > the cost of the `cmove` and the `branching` version of the code. That depends > on the latency of the inputs, and the outputs: does the `cmove` dramatically > increase the latency on the critical path, and `branching` could hide some of > that latency? And you would have to know how good the branch predictor is, > which you cannot derive from the branching probability of our profiling (at > least not when the probabilities are in the middle, and you don't know if it > is a random or predictable pattern). > > If we can find a perfect heuristic - that would be fantastic ;) > > If we cannot find a perfect heuristic, then we should think about what are > the most "common" or "relevant" scenarios, I think. > > But let's discuss all of this in a call / offline :) FYI @eme64 @chhagedorn @rwestrel Since we know that vectorization does not always kick in, there was a worry if scalar fallbacks would heavily suffer with the work included in this PR to add long intrinsic for min/max. Looking at the same scenarios with int (read my comments https://github.com/openjdk/jdk/pull/20098#issuecomment-2669329851 and https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644), it looks clear that the same kind of regressions are also present there. So, if those int scalar regressions were not a problem when int min/max intrinsic was added, I would expect the same to apply to long. Re: https://github.com/openjdk/jdk/pull/20098#issuecomment-2671144644 - I was trying to think what could be causing this. I thought maybe it's due to the int min/max backend, which is implemented in platform specific way, vs the long min/max backend which relies on platform independent macro expansion. But if that theory was true, I would expect the same behaviour with int max vs long max, but that's not the case. It seems odd to only see this difference with min. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2671163220