On Sun, 27 Oct 2024 01:55:41 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>> Thanks for looking at the build changes, @magicus! I've pushed a commit that 
>> removes the extra newline in the makefiles and adds newlines to the ends of 
>> files that were missing them.
>> 
>> Thanks for taking a look as well, @merykitty and @jatin-bhateja! I've pushed 
>> a commit that should address the code suggestions and left some comments as 
>> well.
>
>> @jaskarth thanks for exploring platform-specific lowering!
>> 
>> I briefly looked through the changes, but I didn't get a good understanding 
>> of its design goals. It's hard to see what use cases it is targeted for when 
>> only skeleton code is present. It would really help if there are some cases 
>> ported on top of it for illustration purposes and to solidify the design.
>> 
>> Currently, there are multiple places in the code where IR lowering happens. 
>> In particular:
>> 
>> * IGVN during post loop opts phase (guarded by 
>> `Compile::post_loop_opts_phase()`) (Ideal -> Ideal);
>> * macro expansion (Ideal -> Ideal);
>> * ad-hoc passes (GC barrier expansion, `MacroLogicV`) (Ideal -> Ideal);
>> * final graph reshaping (Ideal -> Ideal);
>> * matcher (Ideal -> Mach).
>> 
>> I'd like to understand how the new pass is intended to interact with 
>> existing cases.
>> 
>> Only the last one is truly platform-specific, but there are some 
>> platform-specific cases exposes in other places (e.g., MacroLogicV pass, 
>> DivMod combining) guarded by some predicates on `Matcher`.
>> 
>> As the `PhaseLowering` is implemented now, it looks like a platform-specific 
>> macro expansion pass (1->many rewriting). But `MacroLogicV` case doesn't fit 
>> such model well.
>> 
> 
> 
> 
> 
>> I see changes to enable platform-specific node classes. As of now, only 
>> Matcher introduces platform-specific nodes and all of them are Mach nodes. 
>> Platform-specific Ideal nodes are declared in shared code, but then their 
>> usages are guarded by `Matcher::has_match_rule()` thus ensuring there's 
>> enough support on back-end side.
>> 
>> Some random observations:
>> 
>> * the pass is performed unconditionally and it iterates over all live nodes; 
>> in contrast, macro nodes and nodes for post-loop opts IGVN are explicitly 
>> listed on the side (MacroLogicV pass is also guarded, but by a 
>> coarse-grained check);
> 
> MacroLogicV currently is a standalone pass and was not implemented as 
> discrete Idealizations split across various logic IR Idealizations, our 
> intention was to create a standalone pass and limit the exposure of such 
> complex logic packing to rest of the code and  guard it by target specific 
> match_rule_supported check. 
> To fit this into new lowering interface, we may need to split root detection 
> across various logic IR node accepted by PhaseLowering::lower_to(Node*).
> 
> @jaskarth, @merykitty , can you share more use cases apart from VPMUL[U]DQ 
> and MacroLogicV detection in support of new lowring pass. Please be aware 
> that even after introducing lowering pass we stil...

@jatin-bhateja @iwanowww The application of lowering is very broad as it can 
help us perform arbitrary transformation as well as take advantages of GVN in 
the ideal world:

1, Any expansion that can benefit from GVN can be done in this pass. The first 
example is `ExtractXNode`s. Currently, it is expanded during code emission. An 
`int` extraction at the index 5 is currently expanded to:

    vextracti128 xmm1, ymm0, 1
    vpextrd eax, xmm1, 1

If we try to extract multiple elements then `vextracti128` would be needlessly 
emitted multiple times. By moving the expansion from code emission to lowering, 
we can do GVN and eliminate the redundant operations. For vector insertions, 
the situation is even worse, as it would be expanded into multiple 
instructions. For example, to construct a vector from 4 long values, we would 
have to:

    vpxor xmm0, xmm0, xmm0

    vmovdqu xmm1, xmm0
    vpinsrq xmm1, xmm1, rax, 0
    vinserti128 ymm0, ymm0, xmm1, 0

    vmovdqu xmm1, xmm0
    vpinsrq xmm1, xmm1, rcx, 1
    vinserti128 ymm0, ymm0, xmm1, 0
    
    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rdx, 0
    vinserti128 ymm0, ymm0, xmm1, 1

    vextracti128 xmm1, ymm0, 1
    vpinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

By moving the expansion to lowering we can have a much more efficient sequence:

    vmovq xmm0, rax
    vpinsrq xmm0, xmm0, rcx, 1
    vmovq xmm1, rdx
    vpinsrq xmm1, xmm1, rbx, 1
    vinserti128 ymm0, ymm0, xmm1, 1

Some other examples are vector mask operations which can be expanded into a 
`VectorMaskToLong` operation followed by a bit counting one, unsigned 
comparisons on AVX2 is expanded into 2 `xor` with the min signed value and a 
signed comparison.

2, @jaskarth mentioned their experiments with the `bextr` instruction. It 
allows a quick bit extraction but the downside is that it does not accept an 
immediate operand. As a result, an immediate operand needs to be materialized 
each time. By doing it here, we can allow that to be moved outside of loops. 
Furthermore, this can be done for variable `bextr`, too. If we see that the 
start and len parameter to be a loop variant then the operand for `bextr` can 
be computed outside the loop.

3, x86 haves several atomic instructions, the downside is that they often does 
not return the old value, which is the proprety only `xadd` has. As a result, a 
`VarHandle::getAndBitwiseAnd` is implemented as a compare exchange loop. This 
phase may recognise the pattern and transform it into a `lock and` if the old 
value is not used.

4, etc

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2440021718

Reply via email to