Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]
> Hi, > > Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only > targets. > Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 > instruction set. > These are very frequently used APIs in columnar database filter operation. > > Implementation uses a lookup table to record permute indices. Table index is > computed using > mask argument of compress/expand operation. > > Following are the performance number of JMH micro included with the patch. > > > System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) > > Baseline: > Benchmark (size) Mode CntScore Error > Units > ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 142.767 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 71.436 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 35.992 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 182.151 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 91.096 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 44.757 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 184.099 > ops/ms > ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 91.981 > ops/ms > ColumnFilterBenchmark.filterIntColumn 4096 thrpt2 45.170 > ops/ms > ColumnFilterBenchmark.filterLongColumn 1024 thrpt2 148.017 > ops/ms > ColumnFilterBenchmark.filterLongColumn 2047 thrpt2 73.516 > ops/ms > ColumnFilterBenchmark.filterLongColumn 4096 thrpt2 36.844 > ops/ms > > Withopt: > Benchmark (size) Mode Cnt Score > Error Units > ColumnFilterBenchmark.filterDoubleColumn1024 thrpt2 2051.707 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn2047 thrpt2 914.072 > ops/ms > ColumnFilterBenchmark.filterDoubleColumn4096 thrpt2 489.898 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 1024 thrpt2 5324.195 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > ops/ms > ColumnFilterBenchmark.filterIntColumn 2047 thrpt2 1791.170 > ops/ms > ColumnFilterBenchmark.filterIntColumn 4096... Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolution - Changes: - all: https://git.openjdk.org/jdk/pull/17261/files - new: https://git.openjdk.org/jdk/pull/17261/files/b2190fc7..cd912308 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=17261&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17261&range=06-07 Stats: 89 lines in 4 files changed: 20 ins; 50 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/17261.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/17261/head:pull/17261 PR: https://git.openjdk.org/jdk/pull/17261
Re: Gatherer: spliterator characteristics are not propagated
> From: "Viktor Klang" > To: "Remi Forax" > Cc: "core-libs-dev" , "Paul Sandoz" > > Sent: Thursday, January 18, 2024 5:14:38 PM > Subject: Re: Gatherer: spliterator characteristics are not propagated >> And for A.andThen(B), A.flags & B.flags should work, the stream is sorted if >> both gatherers keep it sorted. > That is unfortunately not the case. That would presume that you can implement > the composition such that it can preserve all the common flags. Some flags > could be "dominant" and some "recessive" to use genetics nomenclature. Some flags of the stream pipeline are "recessive", mainly SHORT_CIRCUIT, but not the characteristics of the Gatherer which can have the corresponding "dominant" flag, GREEDY, in this case. And the same for sequential, the flag should be PARALELIZABLE and not SEQUENTIAL. The idea is that the Gatherer characteristics can have the same bit set at the same position as the stream op flags (as defined by StreamOpFlag). So KEEP_DISTINCT is in position 0, KEEP_SORTED in in position 2 and KEEP_SIZED is in position 3. For GREEDY, we use the same position as SHORT_CIRCUIT and we will flip the bit (using XOR) when we want to extract the stream op flags from the characteristics All the factory methods call the generic of() with a combination of PARALELIZABLE and STATELESS and the user can adds the characteristics GREEDY, KEEP_DISTINCT, KEEP_SORTED and KEEP_SIZED (otherwise an exception should be raised). In StreamOpFlag, there are two unused positions (14 and 15), that's perfect for our two new states PARALELIZABLE and STATELESS, so no problem here (technically we can also reuse positions of the Spliterator characteristic given that those flags are masked before being sent to the GathererOp super constructor). The way to transform a Gatherer characteristics op to a stream flags op is to flip the bits corresponding to SHORT_CIRCUIT, add the highter bit of all flags but SHORT-CIRCUIT (because stream op flags are encoded using 2 bits) and mask to only retain SHORT_CIRCUIT, KEEP_DISTINCT, KEEP_SORTED and KEEP_SIZED. public static int toOpFlags ( int characteristics ) { return (( characteristics ^ SHORT_CIRCUIT_MASK ) | HIGHER_BITS ) & STREAM_OP_MASK ; } see below for a full script. >> I suppose that if those flags already exist, it's because they have a purpose >> and i do not understand how it can make the other operations slower. > Extra invocations, extra storage, and extra composition overhead is not free. > Since Stream is one-shot you need to include the construction cost with the > execution cost. For something like an empty Stream construction cost scan be > 90+% of the total costs. If you create a Gatherer, the characteristics is a constant (so the validity check is removed, it's just a mask and a test) so the result of calling toOpFlags() is a constant too. If the factory method is not inlined, the cost is 3 bitwise operations which is I believe faster than the "instanceof Greedy" used in the current code. > Cheers, > √ regards, Rémi --- public class GathererFlag { // cut and paste from StreamOpFlag /** * The bit pattern for setting/injecting a flag. */ private static final int SET_BITS = 0b01 ; /** * The bit pattern for clearing a flag. */ private static final int CLEAR_BITS = 0b10 ; /** * The bit pattern for preserving a flag. */ private static final int PRESERVE_BITS = 0b11 ; private static int position ( int opFlagSet ) { return Integer . numberOfTrailingZeros ( opFlagSet ) >> 1 ; } private static final int DISTINCT_POSITION = position ( StreamOpFlag . IS_DISTINCT ); private static final int SORTED_POSITION = position ( StreamOpFlag . IS_SORTED ); private static final int SIZED_POSITION = position ( StreamOpFlag . IS_SIZED ); private static final int SHORT_CIRCUIT_POSITION = position ( StreamOpFlag . IS_SHORT_CIRCUIT ); private static final int STATELESS_POSITION = 14 ; private static final int PARELLIZABLE_POSITION = 15 ; public static final int PARELLIZABLE = SET_BITS << ( PARELLIZABLE_POSITION << 1 ); public static final int STATELESS = SET_BITS << ( STATELESS_POSITION << 1 ); public static final int GREEDY = SET_BITS << ( SHORT_CIRCUIT_POSITION << 1 ); public static final int KEEP_DISTINCT = SET_BITS << ( DISTINCT_POSITION << 1 ); public static final int KEEP_SORTED = SET_BITS << ( SORTED_POSITION << 1 ); public static final int KEEP_SIZED = SET_BITS << ( SIZED_POSITION << 1 ); private static final int SHORT_CIRCUIT_MASK = SET_BITS << ( SHORT_CIRCUIT_POSITION << 1 ); // no GREEDY here private static final int HIGHER_BITS = ( PARELLIZABLE | STATELESS | KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ) << 1 ; private static final int STREAM_OP_MASK = (( GREEDY | KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ) << 1 ) | GREEDY | KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ; public static String toString ( int characteristics ) { return Stream . of ( characteristics ) .< String >mapM
Re: RFR: 8322292: Rearrange comparison of fields in Record.equals() [v6]
On Fri, 22 Dec 2023 13:00:11 GMT, Sergey Tsypanov wrote: >> Currently if we create a record it's fields are compared in their >> declaration order. This might be ineffective in cases when two objects have >> "heavy" fields equals to each other, but different "lightweight" fields >> (heavy and lightweight in terms of comparison) e.g. primitives, enums, >> nullable/non-nulls etc. >> >> If we have declared a record like >> >> public record MyRecord(String field1, int field2) {} >> >> It's equals() looks like: >> >> public final equals(Ljava/lang/Object;)Z >>L0 >> LINENUMBER 3 L0 >> ALOAD 0 >> ALOAD 1 >> INVOKEDYNAMIC >> equals(Lcom/caspianone/openbanking/productservice/controller/MyRecord;Ljava/lang/Object;)Z >> [ >> // handle kind 0x6 : INVOKESTATIC >> >> java/lang/runtime/ObjectMethods.bootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/TypeDescriptor;Ljava/lang/Class;Ljava/lang/String;[Ljava/lang/invoke/MethodHandle;)Ljava/lang/Object; >> // arguments: >> com.caspianone.openbanking.productservice.controller.MyRecord.class, >> "field1;field2", >> // handle kind 0x1 : GETFIELD >> >> com/caspianone/openbanking/productservice/controller/MyRecord.field1(Ljava/lang/String;), >> // handle kind 0x1 : GETFIELD >> com/caspianone/openbanking/productservice/controller/MyRecord.field2(I) >> ] >> IRETURN >>L1 >> LOCALVARIABLE this >> Lcom/caspianone/openbanking/productservice/controller/MyRecord; L0 L1 0 >> LOCALVARIABLE o Ljava/lang/Object; L0 L1 1 >> MAXSTACK = 2 >> MAXLOCALS = 2 >> >> This can be improved by rearranging the comparison order of the fields >> moving enums and primitives upper, and collections/arrays lower. > > Sergey Tsypanov has updated the pull request incrementally with two > additional commits since the last revision: > > - Merge remote-tracking branch 'origin/record-equals' into record-equals > - 8322292: Add test case Anything else I can do about it? - PR Comment: https://git.openjdk.org/jdk/pull/17143#issuecomment-1902184686
Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]
> Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and > AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm which initially > partially unrolls scalar loop to accumulates values from gather indices into > a quadword(64bit) slice followed by vector permutation to place the slice > into appropriate vector lanes, it prevents code bloating and generates > compact JIT sequence. This coupled with savings from expansive array > allocation in existing java implementation translates into significant > performance of 1.5-10x gains with included micro on Intel Atom family CPUs > and with JVM option UseAVX=2. > >  > > > 2) For AVX512 targets algorithm uses integral gather instructions to load > values from normalized indices which are multiple of integer size, followed > by shuffling and packing exact sub-word values from integral lanes. > > 3) Patch was also compared against modified java fallback implementation by > replacing temporary array allocation with zero initialized vector and a > scalar loops which inserts gathered values into vector. But, vector insert > operation in higher vector lanes is a three step process which first extracts > the upper vector 128 bit lane, updates it with gather subword value and then > inserts the lane back to its original position. This makes inserts into > higher order lanes costly w.r.t to proposed solution. In addition generated > JIT code for modified fallback implementation was very bulky. This may impact > in-lining decisions into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. - Changes: - all: https://git.openjdk.org/jdk/pull/16354/files - new: https://git.openjdk.org/jdk/pull/16354/files/de47076e..9ed6b502 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=09-10 Stats: 58 lines in 1 file changed: 18 ins; 15 del; 25 mod Patch: https://git.openjdk.org/jdk/pull/16354.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354 PR: https://git.openjdk.org/jdk/pull/16354