On Thu, 8 Aug 2024 06:57:28 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
> Hi All, > > As per the discussion on panama-dev mailing list[1], patch adds the support > for following new two vector permutation APIs. > > > Declaration:- > Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2) > > > Semantics:- > Using index values stored in the lanes of "this" vector, assemble the > values stored in first (v1) and second (v2) vector arguments. Thus, first and > second vector serves as a table, whose elements are selected based on index > value vector. API is applicable to all integral and floating-point types. > The result of this operation is semantically equivalent to expression > v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must > lie within valid two vector index range [0, 2*VLEN) else an > IndexOutOfBoundException is thrown. > > Summary of changes: > - Java side implementation of new selectFrom API. > - C2 compiler IR and inline expander changes. > - In absence of direct two vector permutation instruction in target ISA, a > lowering transformation dismantles new IR into constituent IR supported by > target platforms. > - Optimized x86 backend implementation for AVX512 and legacy target. > - Function tests covering new API. > > JMH micro included with this patch shows around 10-15x gain over existing > rearrange API :- > Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server] > > > Benchmark (size) Mode Cnt Score > Error Units > SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 > ops/ms > SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 > ops/ms > SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 > ops/ms > SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 > ops/ms > SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 > ops/ms > SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 > ops/ms > SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 > ops/ms > SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 > ops/ms > SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 > ops/ms > SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 > ops/ms > SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 > ops/ms > SelectFromBenchmark.selectFromIntVector 2048 thrpt 2 5398.2... > _Mailing list message from [John Rose](mailto:john.r.r...@oracle.com) on > [hotspot-compiler-dev](mailto:hotspot-compiler-...@mail.openjdk.org):_ > > (Better late than never, although I wish I?d been more explicit about this on > panama-dev.) > > I think we should be moving away from throwing exceptions on all > reorder/shuffle/permute vector ops, and moving toward wrapping. These ops all > operate on vectors (small arrays) of vector lane indexes (small array indexes > in a fixed domain, always a power of two). The throwing behavior checks an > input for bad indexes and throws a (scalar) exception if there are any at > all. The wrapping behavior reduces bad indexes to good ones by an unsigned > modulo operation (which is at worst a mask for powers of two). > > If I?m right, then new API points should start out with wrap semantics, not > throw semantics. And old API points should be migrated ASAP. > > There?s no loss of functionality in such a move. Instead the defaults are > moved around. Before, throwing was the default and wrapping was an explicit > operation. After, wrapping would be the default and throwing would be > explicit. Both wrapping and throwing checks are available through explicit > calls to VectorShuffle methods checkIndexes and wrapIndexes. > > OK, so why is wrapping better than throwing? And first, why did we start with > throwing as the default? Well, we chose throwing as the default to make the > vector operations more Java-like. Java scalar operations don?t try to reduce > bad array indexes into the array domain; they throw. Since a shuffle op is > like an array reference, it makes sense to emulate the checks built into Java > array references. > > Or it did make sense. I think there is a technical debt here which is turning > out to be hard to pay off. The tech debt is to suppress or hoist or > strength-reduce the vector instructions that perform the check for invalid > indexes (in parallel), then ask ?did any of those checks fail?? (a mask > reduction), then do a conditional branch to failure code. I think I was > over-confident that our scalar tactics for reducing array range checks would > apply to vectors as well. On second thought, vectorizing our key > optimization, of loop range splitting (pre/main/post loops) is kind of a > nightmare. > > Instead, consider the alternative of wrapping. First, you use vpand or the > like to mask the indexes down to the valid range. Then you run the > shuffle/permute instruction. That?s it. There is no scalar query or branch. > And, there are probably some circumstances where you can omit the vpand > operation: Perhaps the hardware already masks the inputs (as with shift > instructions). Or, perhaps C2 can do bitwise inference of the vectors and > figure out that the vpand is a nop. (I am agitating for bitwise types in C2; > this is a use case for them.) In the worst case, the vpand op is fast and > pipelines well. > > This is why I think we should switch, ASAP, to masking instead of throwing, > on bad indexes. > > I think some of our reports from customers have shown that the extra checks > necessary for throwing on bad indexes are giving their code surprising > slowdowns, relative to C-based vector code. > > Did I miss a point? > > ? John > > On 14 Aug 2024, at 3:43, Jatin Bhateja wrote: Hi @rose00, I agree that wrapping should be the default behaviour if indices are passed through shuffles, idea was to pick exception throwing semantics for out of bounds indexes *only* for selectFrom flavour of APIs which accept indexes through vector interface, this will save redundant partial wrapping and un-wrapping for cross vector permutation API which has a direct mappings in x86 and AARCH64 ISA. As @PaulSandoz [suggested](https://github.com/openjdk/jdk/pull/20508#pullrequestreview-2234095541) we can also tune existing single 'selectFrom' API to adopt default exception throwing semantics if any of the indices lies beyond valid index range. While we will continue keeping default wrapping semantics for APIs accepting shuffles, this little deviation of semantics for selectFrom family of APIs will enable generating efficient code and will enable users to chooses between the rearrange and selectFrom APIs based on convenience vs efficient code trade-off. Since, API interfaces were crafted keeping in view long term flexibility, having multiple permutation interfaces (selectFrom / rearrange) accepting indexes though vector or shuffle enables compiler to emit efficient code. Best Regards, Jatin ------------- PR Comment: https://git.openjdk.org/jdk/pull/20508#issuecomment-2295785781