In integer-dominated code, it is often useful to use floating point registers to do block copies. If suitable alignment is available, 64 bit loads / stores allow to do the copy with half as many memory operations. If the source is loop invariant, the loads can be hoisted out of the loop; register pressure usually makes this unfeasible for integer registers. The destination, and, if not loop invariant, the source need to be at least 32 bit aligned for this to be profitable (or at least there must be a known constant offset to such an alignment. At -O3, preconditioning could be used to cover all possible offsets and select the code at run-time). Also, a minimum size is required. The total size need not be aligned, as smaller pieces can be copied in integer registers.
A testcase for this is the main loop of dhrystone, where the two strings fit into 4 64-bit values each (after padding), and cse allows to fit them in 5 64-bit values together. Four of these fit into the call saved registers dr12, dr14, xd12 and xd14, thus their loads can be hoisted out of the loop. The tree of the current function could be examined for heuristics to determine if using floating point registers for block copies makes sense (look for high integer register pressure and low floating point register pressure - call saved registers if a loop invariant crosses a call; might also take different integer / floating point memory latencies into account if the block is relatively short, by checking if there appear to be a sufficient number of other instructions to hide some of the latency. Alternatively or additionally, an option and/or parameters used in the heuristics can be used to control the behaviour. To increase the incidence of suitably aligned copies, constant alignment and data alignment for block copy destinations of suitable size which are defined in the current compilation unit should be increased to 64 bit, and such data items should also be padded to 64 bits. This may be controlled by an invocation option. (If the last 64 bit item would contain no more than 32 bits, and the register pressure is too high to hoist out all loads, padding to fit 8 / 16 / 32 bit is sufficient. The latter padding is useful for integer copies in general) When doing LTO, this might be expanded to items which are defined in other compilation units, and to special cases of indirect references. The actual copy is best done exploiting post-increment for load and pre-decrement for store, and is thus highly machine specific. It therefore seems best to do this in sh.c:expand_block_move. Thus, STORE_BY_PIECES_P and MOVE_BY_PIECES_P will have to reject the size and alignment combinations of copies that we want to handle this way. Due to a quirk in the SH4 specification, we need a third fp_mode value for 64 bit loads / stores (unless FMOVD_WORKS is true). This mode has FPSCR.PR cleared and FPSCR.SZ set. To get the full benefit for copies that are in a loop that does calls, we should fix rtl-optimization/29349 first. When using the -m4-single ABI, the new mode can be generated from the normal mode by issuing one fschg instruction; we can switch back with another fschg instruction. For -m4a or -m4-300, we need both an fpchg and an fschg; -m4 must load the new mode from a third value in fpscr_values. The actual loads and stores must not look like ordinary SImode or DImode loads and stores, because that would give - via GO_IF_LEGITIMATE_ADDRESS - the wrong message to the optimizers about the available addressing modes. Moreover, POST_INC / PRE_DEC are currently not allowed at rtl generation time. A possible sulution is to use patterns that pair the load / store with an explicit set of the address register. I'd prefer to use two match_dup to keep the address register in sync, since otherwise the optimizers can too easily hijack the pattern for something inappropriate. The MEMs are probably best using SFmode / DFmode, but wrapping them in an SImode / DImode unspec; however, care must be taken to still get the right alias set for the MEM. -- Summary: should use floating point registers for block copies Product: gcc Version: 4.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: amylaar at gcc dot gnu dot org GCC target triplet: sh4*-*-* BugsThisDependsOn: 29349 OtherBugsDependingO 29842 nThis: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29969