date:20240120

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]

2024-01-20 Thread Jatin Bhateja

> Hi,
> 
> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only 
> targets.
> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 
> instruction set.
> These are very frequently used APIs in columnar database filter operation.
> 
> Implementation uses a lookup table to record permute indices. Table index is 
> computed using
> mask argument of compress/expand operation.
> 
> Following are the performance number of JMH micro included with the patch.
> 
> 
> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
> 
> Baseline:
> Benchmark (size)   Mode  CntScore   Error 
>   Units
> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  142.767 
>  ops/ms
> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   71.436 
>  ops/ms
> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   35.992 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  182.151 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2   91.096 
>  ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2   44.757 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  184.099 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2   91.981 
>  ops/ms
> ColumnFilterBenchmark.filterIntColumn   4096  thrpt2   45.170 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  1024  thrpt2  148.017 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  2047  thrpt2   73.516 
>  ops/ms
> ColumnFilterBenchmark.filterLongColumn  4096  thrpt2   36.844 
>  ops/ms
> 
> Withopt:
> Benchmark (size)   Mode  Cnt Score   
> Error   Units
> ColumnFilterBenchmark.filterDoubleColumn1024  thrpt2  2051.707
>   ops/ms
> ColumnFilterBenchmark.filterDoubleColumn2047  thrpt2   914.072
>   ops/ms
> ColumnFilterBenchmark.filterDoubleColumn4096  thrpt2   489.898
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 1024  thrpt2  5324.195
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047  thrpt2  2587.229
>   ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096  thrpt2  1278.665
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   1024  thrpt2  4149.384
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   2047  thrpt2  1791.170
>   ops/ms
> ColumnFilterBenchmark.filterIntColumn   4096...

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  Review comments resolution

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/17261/files
  - new: https://git.openjdk.org/jdk/pull/17261/files/b2190fc7..cd912308

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=17261&range=07
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=17261&range=06-07

  Stats: 89 lines in 4 files changed: 20 ins; 50 del; 19 mod
  Patch: https://git.openjdk.org/jdk/pull/17261.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/17261/head:pull/17261

PR: https://git.openjdk.org/jdk/pull/17261

Re: Gatherer: spliterator characteristics are not propagated

2024-01-20 Thread forax

> From: "Viktor Klang" 
> To: "Remi Forax" 
> Cc: "core-libs-dev" , "Paul Sandoz"
> 
> Sent: Thursday, January 18, 2024 5:14:38 PM
> Subject: Re: Gatherer: spliterator characteristics are not propagated

>> And for A.andThen(B), A.flags & B.flags should work, the stream is sorted if
>> both gatherers keep it sorted.

> That is unfortunately not the case. That would presume that you can implement
> the composition such that it can preserve all the common flags. Some flags
> could be "dominant" and some "recessive" to use genetics nomenclature.

Some flags of the stream pipeline are "recessive", mainly SHORT_CIRCUIT, but 
not the characteristics of the Gatherer which can have the corresponding 
"dominant" flag, GREEDY, in this case. 
And the same for sequential, the flag should be PARALELIZABLE and not 
SEQUENTIAL. 

The idea is that the Gatherer characteristics can have the same bit set at the 
same position as the stream op flags (as defined by StreamOpFlag). 
So KEEP_DISTINCT is in position 0, KEEP_SORTED in in position 2 and KEEP_SIZED 
is in position 3. 
For GREEDY, we use the same position as SHORT_CIRCUIT and we will flip the bit 
(using XOR) when we want to extract the stream op flags from the 
characteristics 

All the factory methods call the generic of() with a combination of 
PARALELIZABLE and STATELESS and the user can adds the characteristics GREEDY, 
KEEP_DISTINCT, KEEP_SORTED and KEEP_SIZED (otherwise an exception should be 
raised). 

In StreamOpFlag, there are two unused positions (14 and 15), that's perfect for 
our two new states PARALELIZABLE and STATELESS, so no problem here (technically 
we can also reuse positions of the Spliterator characteristic given that those 
flags are masked before being sent to the GathererOp super constructor). 

The way to transform a Gatherer characteristics op to a stream flags op is to 
flip the bits corresponding to SHORT_CIRCUIT, add the highter bit of all flags 
but SHORT-CIRCUIT (because stream op flags are encoded using 2 bits) and mask 
to only retain SHORT_CIRCUIT, KEEP_DISTINCT, KEEP_SORTED and KEEP_SIZED. 

public static int toOpFlags ( int characteristics ) { 
return (( characteristics ^ SHORT_CIRCUIT_MASK ) | HIGHER_BITS ) & 
STREAM_OP_MASK ; 
} 

see below for a full script. 

>> I suppose that if those flags already exist, it's because they have a purpose
>> and i do not understand how it can make the other operations slower.

> Extra invocations, extra storage, and extra composition overhead is not free.
> Since Stream is one-shot you need to include the construction cost with the
> execution cost. For something like an empty Stream construction cost scan be
> 90+% of the total costs.

If you create a Gatherer, the characteristics is a constant (so the validity 
check is removed, it's just a mask and a test) so the result of calling 
toOpFlags() is a constant too. 

If the factory method is not inlined, the cost is 3 bitwise operations which is 
I believe faster than the "instanceof Greedy" used in the current code. 

> Cheers,
> √

regards, 
Rémi 

--- 
public class GathererFlag { 
// cut and paste from StreamOpFlag 
/** 
* The bit pattern for setting/injecting a flag. 
*/ 
private static final int SET_BITS = 0b01 ; 

/** 
* The bit pattern for clearing a flag. 
*/ 
private static final int CLEAR_BITS = 0b10 ; 

/** 
* The bit pattern for preserving a flag. 
*/ 
private static final int PRESERVE_BITS = 0b11 ; 

private static int position ( int opFlagSet ) { 
return Integer . numberOfTrailingZeros ( opFlagSet ) >> 1 ; 
} 

private static final int DISTINCT_POSITION = position ( StreamOpFlag . 
IS_DISTINCT ); 
private static final int SORTED_POSITION = position ( StreamOpFlag . IS_SORTED 
); 
private static final int SIZED_POSITION = position ( StreamOpFlag . IS_SIZED ); 

private static final int SHORT_CIRCUIT_POSITION = position ( StreamOpFlag . 
IS_SHORT_CIRCUIT ); 
private static final int STATELESS_POSITION = 14 ; 
private static final int PARELLIZABLE_POSITION = 15 ; 

public static final int PARELLIZABLE = SET_BITS << ( PARELLIZABLE_POSITION << 1 
); 
public static final int STATELESS = SET_BITS << ( STATELESS_POSITION << 1 ); 
public static final int GREEDY = SET_BITS << ( SHORT_CIRCUIT_POSITION << 1 ); 

public static final int KEEP_DISTINCT = SET_BITS << ( DISTINCT_POSITION << 1 ); 
public static final int KEEP_SORTED = SET_BITS << ( SORTED_POSITION << 1 ); 
public static final int KEEP_SIZED = SET_BITS << ( SIZED_POSITION << 1 ); 

private static final int SHORT_CIRCUIT_MASK = SET_BITS << ( 
SHORT_CIRCUIT_POSITION << 1 ); 

// no GREEDY here 
private static final int HIGHER_BITS = ( PARELLIZABLE | STATELESS | 
KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ) << 1 ; 

private static final int STREAM_OP_MASK = 
(( GREEDY | KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ) << 1 ) | 
GREEDY | KEEP_DISTINCT | KEEP_SORTED | KEEP_SIZED ; 

public static String toString ( int characteristics ) { 
return Stream . of ( characteristics ) 
.< String >mapM

Re: RFR: 8322292: Rearrange comparison of fields in Record.equals() [v6]

2024-01-20 Thread Sergey Tsypanov

On Fri, 22 Dec 2023 13:00:11 GMT, Sergey Tsypanov  wrote:

>> Currently if we create a record it's fields are compared in their 
>> declaration order. This might be ineffective in cases when two objects have 
>> "heavy" fields equals to each other, but different "lightweight" fields 
>> (heavy and lightweight in terms of comparison) e.g. primitives, enums, 
>> nullable/non-nulls etc.
>> 
>> If we have declared a record like
>> 
>> public record MyRecord(String field1, int field2) {}
>> 
>> It's equals() looks like:
>> 
>>   public final equals(Ljava/lang/Object;)Z
>>L0
>> LINENUMBER 3 L0
>> ALOAD 0
>> ALOAD 1
>> INVOKEDYNAMIC 
>> equals(Lcom/caspianone/openbanking/productservice/controller/MyRecord;Ljava/lang/Object;)Z
>>  [
>>   // handle kind 0x6 : INVOKESTATIC
>>   
>> java/lang/runtime/ObjectMethods.bootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/TypeDescriptor;Ljava/lang/Class;Ljava/lang/String;[Ljava/lang/invoke/MethodHandle;)Ljava/lang/Object;
>>   // arguments:
>>   com.caspianone.openbanking.productservice.controller.MyRecord.class,
>>   "field1;field2",
>>   // handle kind 0x1 : GETFIELD
>>   
>> com/caspianone/openbanking/productservice/controller/MyRecord.field1(Ljava/lang/String;),
>>   // handle kind 0x1 : GETFIELD
>>   com/caspianone/openbanking/productservice/controller/MyRecord.field2(I)
>> ]
>> IRETURN
>>L1
>> LOCALVARIABLE this 
>> Lcom/caspianone/openbanking/productservice/controller/MyRecord; L0 L1 0
>> LOCALVARIABLE o Ljava/lang/Object; L0 L1 1
>> MAXSTACK = 2
>> MAXLOCALS = 2
>> 
>> This can be improved by rearranging the comparison order of the fields 
>> moving enums and primitives upper, and collections/arrays lower.
>
> Sergey Tsypanov has updated the pull request incrementally with two 
> additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'origin/record-equals' into record-equals
>  - 8322292: Add test case

Anything else I can do about it?

-

PR Comment: https://git.openjdk.org/jdk/pull/17143#issuecomment-1902184686

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-01-20 Thread Jatin Bhateja

> Hi All,
> 
> This patch optimizes sub-word gather operation for x86 targets with AVX2 and 
> AVX512 features.
> 
> Following is the summary of changes:-
> 
> 1) Intrinsify sub-word gather using hybrid algorithm which initially 
> partially unrolls scalar loop to accumulates values from gather indices into 
> a quadword(64bit) slice followed by vector permutation to place the slice 
> into appropriate vector lanes, it prevents code bloating and generates 
> compact JIT sequence. This coupled with savings from expansive array 
> allocation in existing java implementation translates into significant 
> performance of 1.5-10x gains with included micro on Intel Atom family CPUs 
> and with JVM option UseAVX=2.
> 
> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
> 
> 
> 2) For AVX512 targets algorithm uses integral gather instructions to load 
> values from normalized indices which are multiple of integer size, followed 
> by shuffling and packing exact sub-word values from integral lanes. 
> 
> 3) Patch was also compared against modified java fallback implementation by 
> replacing temporary array allocation with zero initialized vector and a 
> scalar loops which inserts gathered values into vector. But, vector insert 
> operation in higher vector lanes is a three step process which first extracts 
> the upper vector 128 bit lane, updates it with gather subword value and then 
> inserts the lane back to its original position. This makes inserts into 
> higher order lanes costly w.r.t to proposed solution. In addition generated 
> JIT code for modified fallback implementation was very bulky. This may impact 
> in-lining decisions into caller contexts.
> 
> Kindly review and share your feedback.
> 
> Best Regards,
> Jatin

Jatin Bhateja has updated the pull request incrementally with one additional 
commit since the last revision:

  Review comments resolutions.

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/16354/files
  - new: https://git.openjdk.org/jdk/pull/16354/files/de47076e..9ed6b502

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=10
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=16354&range=09-10

  Stats: 58 lines in 1 file changed: 18 ins; 15 del; 25 mod
  Patch: https://git.openjdk.org/jdk/pull/16354.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16354/head:pull/16354

PR: https://git.openjdk.org/jdk/pull/16354

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]

Re: Gatherer: spliterator characteristics are not propagated

Re: RFR: 8322292: Rearrange comparison of fields in Record.equals() [v6]

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

4 matches

Site Navigation

Mail list logo

Footer information