On Mon, 28 Oct 2024 22:46:55 GMT, Vladimir Ivanov <vliva...@openjdk.org> wrote:
>>> @jatin-bhateja @iwanowww The application of lowering is very broad as it >>> can help us perform arbitrary transformation as well as take advantages of >>> GVN in the ideal world: >>> >>> 1, Any expansion that can benefit from GVN can be done in this pass. The >>> first example is `ExtractXNode`s. Currently, it is expanded during code >>> emission. An `int` extraction at the index 5 is currently expanded to: >>> >>> ``` >>> vextracti128 xmm1, ymm0, 1 >>> vpextrd eax, xmm1, 1 >>> ``` >>> >>> If we try to extract multiple elements then `vextracti128` would be >>> needlessly emitted multiple times. By moving the expansion from code >>> emission to lowering, we can do GVN and eliminate the redundant operations. >>> For vector insertions, the situation is even worse, as it would be expanded >>> into multiple instructions. For example, to construct a vector from 4 long >>> values, we would have to: >>> >>> ``` >>> vpxor xmm0, xmm0, xmm0 >>> >>> vmovdqu xmm1, xmm0 >>> vpinsrq xmm1, xmm1, rax, 0 >>> vinserti128 ymm0, ymm0, xmm1, 0 >>> >>> vmovdqu xmm1, xmm0 >>> vpinsrq xmm1, xmm1, rcx, 1 >>> vinserti128 ymm0, ymm0, xmm1, 0 >>> >>> vextracti128 xmm1, ymm0, 1 >>> vpinsrq xmm1, xmm1, rdx, 0 >>> vinserti128 ymm0, ymm0, xmm1, 1 >>> >>> vextracti128 xmm1, ymm0, 1 >>> vpinsrq xmm1, xmm1, rbx, 1 >>> vinserti128 ymm0, ymm0, xmm1, 1 >>> ``` >>> >>> By moving the expansion to lowering we can have a much more efficient >>> sequence: >>> >>> ``` >>> vmovq xmm0, rax >>> vpinsrq xmm0, xmm0, rcx, 1 >>> vmovq xmm1, rdx >>> vpinsrq xmm1, xmm1, rbx, 1 >>> vinserti128 ymm0, ymm0, xmm1, 1 >>> ``` >>> >> >> Hi @jaskarth >> Target specific IR compliments lowering pass, the example above very >> appropriately showcases the usefulness of lowering pass. For completeness we >> should extend this patch and add target specific extensions to >> "opto/classes.hpp" and a new <target\>Node.hpp' to record new target >> specific IR definitions. >> >> Hi @merykitty , >> Lowering will also reduce register pressure since we may be able to save >> additional temporary machine operands by splitting monolithic instruction >> encoding blocks across multiple lowered IR nodes, this together with GVN >> promoted sharing should be very powerful. > >> The application of lowering is very broad as it can help us perform >> arbitrary transformation as well as take advantages of GVN > > @merykitty thanks for the examples. The idea of gradual IR lowering is not > new in C2. There are precedents in the code base, so I'd like to better > understand how the new pass improves the overall situation. Introducing a way > to perform arbitrary platform-specific transformations on Ideal does sound > very powerful, but it also turns Ideal IR into platform-specific dialects > which don't have to work with existing transformations (IGVN, in particular). > > Do the use cases mentioned so far justify a platform-specific lowering pass > on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I > don't know yet. > > Also, there are alternative places where platform-specific transformations > can take place (macro expansion, final graph reshaping, custom matching > logic). Worth considering them as well. @iwanowww I hope to address some of your concerns: > It looks attractive at first, but the downside is subsequent passes may start > to require platform-specific code as well (e.g., think of final graph > reshaping which operates on Node opcodes). Also, total number of > platform-specific Ideal nodes was low (especially, when compared to Mach > nodes generated from AD files). So, keeping relevant code shared and guarding > its usages with `Matcher::match_rule_supported()` seems appropriate. It would not be possible without a stretch, consider my example regarding `ExtractINode` above, since `Matcher::match_rule_support(ExtractINode)` will surely return `true`, we would need another `Matcher` method to decide when and how to expand such a node, as it is a really peculiar circumstance that x86 element extraction/insertion operations is only available with 128-bit vectors, and to do so with higher elements, we need to extract the corresponding 128-bit lane first. What do you think about keeping the node declaration in shared code but putting the lowering transformations in the backend-specific source files? We can then use prefixes to denote a node being available on a specific backend only. > `MacroLogicV` pass is guarded by `C->max_vector_size() > 0` and > `Matcher::match_rule_supported(Op_MacroLogicV)` which (1) limits it to > AVX512-capable hardware; and (2) ensures that some vector nodes were produced > during compilation. It is a coarser-grained check than strictly required, but > very effective at detecting when there are no optimization opportunities > present. I don't think this is a concern, enumerating all live nodes once without doing anything is not expensive. > The idea of gradual IR lowering is not new in C2. There are precedents in the > code base, so I'd like to better understand how the new pass improves the > overall situation. Introducing a way to perform arbitrary platform-specific > transformations on Ideal does sound very powerful, but it also turns Ideal IR > into platform-specific dialects which don't have to work with existing > transformations (IGVN, in particular). That's why it is intended to be executed only after general `igvn`. > Do the use cases mentioned so far justify a platform-specific lowering pass > on Ideal IR which is intended to produce platform-specific Ideal IR shapes? I > don't know yet. As you have mentioned, we do have platform-specific transformations already, the issue is that they are fragmented in shared code. Introducing lowering allows us to consolidate those into 1 place with platform-specific transformations living nicely in plarform-specific code. And in addition to that, it allows us to perform more platform-specific transformations in a scalable manner, such as #21244 . > Also, there are alternative places where platform-specific transformations > can take place (macro expansion, final graph reshaping, custom matching > logic). Worth considering them as well. Macro expansion would be too early, as we still do platform-independent `igvn` there, while final graph reshaping and custom matching logic would be too late, as we have destroyed the node hash table already. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21599#issuecomment-2442875876