On Mon, 28 Oct 2024 22:46:55 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:

>>> @jatin-bhateja @iwanowww The application of lowering is very broad as it 
>>> can help us perform arbitrary transformation as well as take advantages of 
>>> GVN in the ideal world:
>>> 
>>> 1, Any expansion that can benefit from GVN can be done in this pass. The 
>>> first example is `ExtractXNode`s. Currently, it is expanded during code 
>>> emission. An `int` extraction at the index 5 is currently expanded to:
>>> 
>>> ```
>>> vextracti128 xmm1, ymm0, 1
>>> vpextrd eax, xmm1, 1
>>> ```
>>> 
>>> If we try to extract multiple elements then `vextracti128` would be 
>>> needlessly emitted multiple times. By moving the expansion from code 
>>> emission to lowering, we can do GVN and eliminate the redundant operations. 
>>> For vector insertions, the situation is even worse, as it would be expanded 
>>> into multiple instructions. For example, to construct a vector from 4 long 
>>> values, we would have to:
>>> 
>>> ```
>>> vpxor xmm0, xmm0, xmm0
>>> 
>>> vmovdqu xmm1, xmm0
>>> vpinsrq xmm1, xmm1, rax, 0
>>> vinserti128 ymm0, ymm0, xmm1, 0
>>> 
>>> vmovdqu xmm1, xmm0
>>> vpinsrq xmm1, xmm1, rcx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 0
>>> 
>>> vextracti128 xmm1, ymm0, 1
>>> vpinsrq xmm1, xmm1, rdx, 0
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> 
>>> vextracti128 xmm1, ymm0, 1
>>> vpinsrq xmm1, xmm1, rbx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> ```
>>> 
>>> By moving the expansion to lowering we can have a much more efficient 
>>> sequence:
>>> 
>>> ```
>>> vmovq xmm0, rax
>>> vpinsrq xmm0, xmm0, rcx, 1
>>> vmovq xmm1, rdx
>>> vpinsrq xmm1, xmm1, rbx, 1
>>> vinserti128 ymm0, ymm0, xmm1, 1
>>> ```
>>> 
>> 
>> Hi @jaskarth 
>> Target specific IR compliments lowering pass, the example above very 
>> appropriately showcases the usefulness of lowering pass. For completeness we 
>> should extend this patch and add target specific extensions to 
>> "opto/classes.hpp" and a new <target\>Node.hpp' to record new target 
>> specific IR definitions.
>> 
>> Hi @merykitty ,
>> Lowering will also reduce register pressure since we may be able to save 
>> additional temporary machine operands by splitting monolithic instruction 
>> encoding blocks across multiple lowered IR nodes, this together with GVN 
>> promoted sharing should be very powerful.
>
>> The application of lowering is very broad as it can help us perform 
>> arbitrary transformation as well as take advantages of GVN 
> 
> @merykitty thanks for the examples. The idea of gradual IR lowering is not 
> new in C2. There are precedents in the code base, so I'd like to better 
> understand how the new pass improves the overall situation. Introducing a way 
> to perform arbitrary platform-specific transformations on Ideal does sound 
> very powerful, but it also turns Ideal IR into platform-specific dialects 
> which don't have to work with existing transformations (IGVN, in particular).
> 
> Do the use cases mentioned so far justify a platform-specific lowering pass 
> on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I 
> don't know yet.
> 
> Also, there are alternative places where platform-specific transformations 
> can take place (macro expansion, final graph reshaping, custom matching 
> logic). Worth considering them as well.

@iwanowww I hope to address some of your concerns:

> It looks attractive at first, but the downside is subsequent passes may start 
> to require platform-specific code as well (e.g., think of final graph 
> reshaping which operates on Node opcodes). Also, total number of 
> platform-specific Ideal nodes was low (especially, when compared to Mach 
> nodes generated from AD files). So, keeping relevant code shared and guarding 
> its usages with `Matcher::match_rule_supported()` seems appropriate.

It would not be possible without a stretch, consider my example regarding 
`ExtractINode` above, since `Matcher::match_rule_support(ExtractINode)` will 
surely return `true`, we would need another `Matcher` method to decide when and 
how to expand such a node, as it is a really peculiar circumstance that x86 
element extraction/insertion operations is only available with 128-bit vectors, 
and to do so with higher elements, we need to extract the corresponding 128-bit 
lane first. What do you think about keeping the node declaration in shared code 
but putting the lowering transformations in the backend-specific source files? 
We can then use prefixes to denote a node being available on a specific backend 
only.

> `MacroLogicV` pass is guarded by `C->max_vector_size() > 0` and 
> `Matcher::match_rule_supported(Op_MacroLogicV)` which (1) limits it to 
> AVX512-capable hardware; and (2) ensures that some vector nodes were produced 
> during compilation. It is a coarser-grained check than strictly required, but 
> very effective at detecting when there are no optimization opportunities 
> present.

I don't think this is a concern, enumerating all live nodes once without doing 
anything is not expensive.

> The idea of gradual IR lowering is not new in C2. There are precedents in the 
> code base, so I'd like to better understand how the new pass improves the 
> overall situation. Introducing a way to perform arbitrary platform-specific 
> transformations on Ideal does sound very powerful, but it also turns Ideal IR 
> into platform-specific dialects which don't have to work with existing 
> transformations (IGVN, in particular).

That's why it is intended to be executed only after general `igvn`.

> Do the use cases mentioned so far justify a platform-specific lowering pass 
> on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I 
> don't know yet.

As you have mentioned, we do have platform-specific transformations already, 
the issue is that they are fragmented in shared code. Introducing lowering 
allows us to consolidate those into 1 place with platform-specific 
transformations living nicely in plarform-specific code. And in addition to 
that, it allows us to perform more platform-specific transformations in a 
scalable manner, such as #21244 .

> Also, there are alternative places where platform-specific transformations 
> can take place (macro expansion, final graph reshaping, custom matching 
> logic). Worth considering them as well.

Macro expansion would be too early, as we still do platform-independent `igvn` 
there, while final graph reshaping and custom matching logic would be too late, 
as we have destroyed the node hash table already.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2442875876

Reply via email to