On Fri, Feb 3, 2023 at 10:16 PM Michael Meissner via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > These patches were originally posted on November 10th. Segher has asked that > I > repost them. These patches are somewhat changed since the original posting to > address some of the comments. > > https://gcc.gnu.org/pipermail/gcc-patches/2022-November/605581.html > > In the first patch (adding -mcpu=future), I have taken out the code of making > -mtune=future act as -mtune=power10. Instead I went through all of the places > that look at the tuning (mostly in power10.md and rs6000.cc), and added future > as an option. Obviously at a later time, we will provide a separate tuning > file for future (or whatever the new name will be if the instructions are > added > officially). But for now, it will suffice. > > In patch #3, I fixed the opcode for clearing a dense math register that Peter > had noticed. I was using the name based on the existing clear instruction, > instead of the new instruction. > > In patch #6, I fixed the code, relying on the changes for setting the > precision > field to 16 bits. Since that patch will not be able to go into GCC 13 at > present, we might skip that support for now. The important thing for existing > users of the MMA code is the support for accumulators being in the separate > dense math registers rather than overlapping does need to go in, and we can > probably delay the 1,024 bit register support, or implement in a different > fashion. > > In the insn names, I tried to switch to using _vsx instead of _fpr for the > existing MMA support instructions. I also tried to clear up the comments to > specify ISA 3.1 instead of power10 when talking about the existing MMA > support. > > The following is from the original posting (slightly modified): > > This patch is very preliminary support for a potential new feature to the > PowerPC that extends the current power10 MMA architecture. This feature may > or > may not be present in any specific future PowerPC processor. > > In the current MMA subsystem for Power10, there are 8 512-bit accumulator > registers. These accumulators are each tied to sets of 4 FPR registers. When > you issue a prime instruction, it makes sure the accumulator is a copy of the > 4 > FPR registers the accumulator is tied to. When you issue a deprime > instruction, it makes sure that the accumulator data content is logically > copied to the matching FPR register. > > In the potential dense math system, the accumulators are moved to separate > registers called dense math registers (DM registers or DMR). The DMRs are > then > extended to 1,024 bits and new instructions will be added to deal with all > 1,024 bits of the DMRs. > > If you take existing MMA code, it will work as long as you don't do anything > with accumulators, and you follow the rules in the ISA 3.1 documentation for > using the MMA subsystem. > > These patches add support for the 512-bit accumulators within the dense math > system, and for allocation of the 1,024-bit DMRs. At this time, no additional > built-in functions will be done to support any dense math features other than > doing data movement between the DMRs and the VSX registers. Before we can > look > at adding any new dense math support other than data movement, we need the GCC > compiler to be able to allocate and use these DMRs. > > There are 8 patches in this patch set: > > 1) The first patch just adds -mcpu=future as an option to add new support. > This is similar to the -mcpu=future that we did before power10 was announced. > > 2) The second patch enables GCC to use the load and store vector pair > instructions to optimize memory copy operations in the compiler. For power10, > we needed to just stay with normal vector load/stores for memory copy > operations. > > 3) The third patch enables 512-bit accumulators store in DMRs. This patch > enables the register allocation, but it does not move the existing MMA to use > these registers. > > 4) The fourth patch switches the MMA subsystem to use 512-bit accumulators > within DMRs if you use -mcpu=future. > > 5) The fifth patch switches the names of the MMA instructions to use the dense > math equivalent name if -mcpu=future. > > 6) The sixth patch enables using the full 1,024-bit DMRs. Right now, all you > can do with DMRs is move a VSX register to a DMR register, and to move a DMR > register to a VSX register. [As I mentioned above, at the moment, this patch > is problematical as is] > > 7) The seventh patch is not DMR related. It adds support for variants of the > load/store vector with length instruction that may be added in future PowerPC > processors. These variants eliminate having to shift the byte length left by > 56 bits. > > 8) The eighth patch is also not DMR related. It adds support for a saturating > subtract operation that may be added to future PowerPC processors. > > In terms of changes, we now use the wD constraint for accumulators. If you > compile with -mcpu=power10, the wD constraint will match the equivalent VSX > register (0..31) that overlaps with the accumulator. If you compile with > -mcpu=future, the wD constraint will match the DMR register and not the FPR > register. > > This patch also modifies the print_operand %A output modifier to print out DMR > register numbers if -mcpu=future, and continue to print out the FPR register > number divided by 4 for -mcpu=power10. > > In general, if you only use the built-in functions, things work between the > two > systems. If you use extended asm, you will likely need to modify the code. > Going forward, hopefully if you modify your code to use the wD constraint and > %A output modifier, you can write code that switches more easily between the > two systems. > > Again, these are preliminary patches for a potential future machine. Things > will likely change in terms of implementation and usage over time.
May I ask to consider delaying this to stage1 exactly because of this last reason? Richard. > > -- > Michael Meissner, IBM > PO Box 98, Ayer, Massachusetts, USA, 01432 > email: meiss...@linux.ibm.com