Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Quan Anh Mai Wed, 06 Nov 2024 09:40:37 -0800

On Mon, 14 Oct 2024 12:12:58 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>>> I am having a similar idea that is to group those transformations together 
>>> into a `Phase` called `PhaseLowering`
>> 
>> I think such a phase could be quite useful in general. Recently I was trying 
>> to implement the BMI1 instruction `bextr` for better performance with bit 
>> masks, but ran into a problem where it doesn't have an immediate encoding so 
>> we'd need to manifest a constant into a temporary register every time. With 
>> an (x86-specific) ideal node, we could simply let the register allocator 
>> handle placing the constant. It would also be nice to avoid needing to put 
>> similar backend-specific lowerings (such as `MacroLogicV`) in shared code.
>
>> > I am having a similar idea that is to group those transformations together 
>> > into a `Phase` called `PhaseLowering`
>> 
>> I think such a phase could be quite useful in general. Recently I was trying 
>> to implement the BMI1 instruction `bextr` for better performance with bit 
>> masks, but ran into a problem where it doesn't have an immediate encoding so 
>> we'd need to manifest a constant into a temporary register every time. With 
>> an (x86-specific) ideal node, we could simply let the register allocator 
>> handle placing the constant. It would also be nice to avoid needing to put 
>> similar backend-specific lowerings (such as `MacroLogicV`) in shared code.
> 
> Hey @jaskarth , @merykitty ,  we already have an infrastructure where during 
> parsing we create Macro Nodes which can be lowered / expanded to multiple IRs 
> nodes during macro expansion, what we need in this case is a target specific 
> IR pattern check since not all targets may support 32x32 multiplication with 
> quadword saturation, idea is to avoid creating a new IR and piggyback needed 
> information on existing MulVL IR, we already use such tricks for relaxed 
> unsafe reductions. Going forward, infusion of KnownBits into our data flow 
> analysis infrastructure will streamline such optimizations, this patch is 
> performing point optimization for specific set of constrained multiplication 
> patterns.

@jatin-bhateja That is machine-independent lowering, we are talking about 
machine-dependent lowering to which `MacroLogicV` transformation belongs. You 
can have `phaselowering_x86` and not have to add another method to `Matcher` as 
well as add default implementations to various architecture files. You can 
reuse `MulVL` node for that but I believe these transformations should be done 
as late as possible.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2411389030

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Reply via email to