On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

> Hi All,
> 
> As per the discussion on panama-dev mailing list[1], patch adds the support 
> for following new two vector permutation APIs.
> 
> 
> Declaration:-
>     Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
> 
> 
> Semantics:-
>     Using index values stored in the lanes of "this" vector, assemble the 
> values stored in first (v1) and second (v2) vector arguments. Thus, first and 
> second vector serves as a table, whose elements are selected based on index 
> value vector. API is applicable to all integral and floating-point types.  
> The result of this operation is semantically equivalent to expression 
> v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must 
> lie within valid two vector index range [0, 2*VLEN) else an 
> IndexOutOfBoundException is thrown.  
> 
> Summary of changes:
> -  Java side implementation of new selectFrom API.
> -  C2 compiler IR and inline expander changes.
> -  In absence of direct two vector permutation instruction in target ISA, a 
> lowering transformation dismantles new IR into constituent IR supported by 
> target platforms. 
> -  Optimized x86 backend implementation for AVX512 and legacy target.
> -  Function tests covering new API.
> 
> JMH micro included with this patch shows around 10-15x gain over existing 
> rearrange API :-
> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
> 
> 
>   Benchmark                                     (size)   Mode  Cnt      Score 
>   Error   Units
> SelectFromBenchmark.rearrangeFromByteVector     1024  thrpt    2   2041.762   
>        ops/ms
> SelectFromBenchmark.rearrangeFromByteVector     2048  thrpt    2   1028.550   
>        ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      1024  thrpt    2    962.605   
>        ops/ms
> SelectFromBenchmark.rearrangeFromIntVector      2048  thrpt    2    479.004   
>        ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     1024  thrpt    2    359.758   
>        ops/ms
> SelectFromBenchmark.rearrangeFromLongVector     2048  thrpt    2    178.192   
>        ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    1024  thrpt    2   1463.459   
>        ops/ms
> SelectFromBenchmark.rearrangeFromShortVector    2048  thrpt    2    727.556   
>        ops/ms
> SelectFromBenchmark.selectFromByteVector        1024  thrpt    2  33254.830   
>        ops/ms
> SelectFromBenchmark.selectFromByteVector        2048  thrpt    2  17313.174   
>        ops/ms
> SelectFromBenchmark.selectFromIntVector         1024  thrpt    2  10756.804   
>        ops/ms
> SelectFromBenchmark.selectFromIntVector         2048  thrpt    2   5398.2...

> _Mailing list message from [John Rose](mailto:john.r.r...@oracle.com) on 
> [hotspot-compiler-dev](mailto:hotspot-compiler-...@mail.openjdk.org):_
> 
> (Better late than never, although I wish I?d been more explicit about this on 
> panama-dev.)
> 
> I think we should be moving away from throwing exceptions on all 
> reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all 
> operate on vectors (small arrays) of vector lane indexes (small array indexes 
> in a fixed domain, always a power of two). The throwing behavior checks an 
> input for bad indexes and throws a (scalar) exception if there are any at 
> all. The wrapping behavior reduces bad indexes to good ones by an unsigned 
> modulo operation (which is at worst a mask for powers of two).
> 
> If I?m right, then new API points should start out with wrap semantics, not 
> throw semantics. And old API points should be migrated ASAP.
> 
> There?s no loss of functionality in such a move. Instead the defaults are 
> moved around. Before, throwing was the default and wrapping was an explicit 
> operation. After, wrapping would be the default and throwing would be 
> explicit. Both wrapping and throwing checks are available through explicit 
> calls to VectorShuffle methods checkIndexes and wrapIndexes.
> 
> OK, so why is wrapping better than throwing? And first, why did we start with 
> throwing as the default? Well, we chose throwing as the default to make the 
> vector operations more Java-like. Java scalar operations don?t try to reduce 
> bad array indexes into the array domain; they throw. Since a shuffle op is 
> like an array reference, it makes sense to emulate the checks built into Java 
> array references.
> 
> Or it did make sense. I think there is a technical debt here which is turning 
> out to be hard to pay off. The tech debt is to suppress or hoist or 
> strength-reduce the vector instructions that perform the check for invalid 
> indexes (in parallel), then ask ?did any of those checks fail?? (a mask 
> reduction), then do a conditional branch to failure code. I think I was 
> over-confident that our scalar tactics for reducing array range checks would 
> apply to vectors as well. On second thought, vectorizing our key 
> optimization, of loop range splitting (pre/main/post loops) is kind of a 
> nightmare.
> 
> Instead, consider the alternative of wrapping. First, you use vpand or the 
> like to mask the indexes down to the valid range. Then you run the 
> shuffle/permute instruction. That?s it. There is no scalar query or branch. 
> And, there are probably some circumstances where you can omit the vpand 
> operation: Perhaps the hardware already masks the inputs (as with shift 
> instructions). Or, perhaps C2 can do bitwise inference of the vectors and 
> figure out that the vpand is a nop. (I am agitating for bitwise types in C2; 
> this is a use case for them.) In the worst case, the vpand op is fast and 
> pipelines well.
> 
> This is why I think we should switch, ASAP, to masking instead of throwing, 
> on bad indexes.
> 
> I think some of our reports from customers have shown that the extra checks 
> necessary for throwing on bad indexes are giving their code surprising 
> slowdowns, relative to C-based vector code.
> 
> Did I miss a point?
> 
> ? John
> 
> On 14 Aug 2024, at 3:43, Jatin Bhateja wrote:


Hi @rose00,

I agree that wrapping should be the default behaviour if indices are passed 
through shuffles, idea was to pick exception throwing semantics for out of 
bounds indexes *only* for selectFrom flavour of APIs which accept indexes 
through vector interface, this will save redundant partial wrapping and 
un-wrapping for cross vector permutation API which has a direct mappings in x86 
and AARCH64 ISA.

As @PaulSandoz 
[suggested](https://github.com/openjdk/jdk/pull/20508#pullrequestreview-2234095541)
 we can also tune existing single 'selectFrom' API to adopt default exception 
throwing semantics if any of the indices lies beyond valid index range.

While we will continue keeping default wrapping semantics for APIs accepting 
shuffles, this little deviation of semantics for selectFrom family of APIs will 
enable generating efficient code and will enable users to chooses between the 
rearrange and selectFrom APIs based on convenience vs efficient code trade-off.

Since, API interfaces were crafted keeping in view long term flexibility, 
having multiple permutation interfaces (selectFrom / rearrange) accepting 
indexes though vector or shuffle enables compiler to emit efficient code.

Best Regards,
Jatin

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2295785781

Reply via email to