emit_no_conflict_block breaks some conditional moves
My port failed the DImode part of the rotate regression-tests (gcc.c-torture/execute/20020508-[123].c). I found that emit_no_conflict_block() reordered insns gen'd by expand_doubleword_shift() in a way that violated dependency between compares and associated conditional-move insns that had the target register as destination. AFAICT, any other port (arc, m32r, v850, xtensa) that emits a cmpsi followed by movsicc and has no native DImode shift insns will be subject to this bug also. Any hints on the proper approach? My initial idea is to make emit_no_conflict_block() maintain pairing between cmpsi and movsicc, which will work as long as cmpsi's operands are never clobbered. Ultimately, I'll side-step the bug by defining expands or splits for DImode shifts & rotates, but I'd like to see emit_no_conflict_block() fixed. Comments? Greg
Re: emit_no_conflict_block breaks some conditional moves
James E Wilson <[EMAIL PROTECTED]> writes: > Greg McGary wrote: > > I found that > > emit_no_conflict_block() reordered insns gen'd by > > expand_doubleword_shift() in a way that violated dependency between > > compares and associated conditional-move insns that had the target > > register as destination. > > You didn't say precisely what went wrong, but I'd guess you have > cmpsi ... > movsicc target, ... > cmpsi ... > movsicc target, ... > which got reordered to > cmpsi ... > cmpsi ... > movsicc target, ... > movsicc target, ... > which obviously does not work if your condition code register is a > hard register. Correct. FYI, the two "cmpsi" insns are identical and redundant, so don't conflict, however, all bit-logic and shift insns on this CPU clobber condition codes, and the CC-producing cmpsi insns are separated from their consumers by CC-clobbering logic & shift insns. > Perhaps a check like > && GET_MODE_CLASS (GET_MODE (SET_DEST (set))) != MODE_CC > or maybe check for any hard register > && (SET_DEST (set) != REG > || REGNO (set) >= FIRST_PSEUDO_REGISTER) > Safer is probably to do both checks, so that we only reject CCmode > hard regs here, e.g. > && (GET_MODE_CLASS (GET_MODE (SET_DEST (set))) != MODE_CC > || SET_DEST (set) != REG > || REGNO (set) >= FIRST_PSEUDO_REGISTER)) > which should handle the specific case you ran into. That will do fine for ports that have conditional move, but without movsicc, you'll have this case: cmpsi ... bcc 1f movsi target, ... 1: cmpsi ... bcc 2f movsi target, ... 2: which without the above fix will be reordered: cmpsi ... bcc 1f 1: cmpsi ... bcc 2f 2: movsi target, ... movsi target, ... while with the above fix, will be reordered: bcc 1f 1: bcc 2f 2: cmpsi ... movsi target, ... cmpsi ... movsi target, ... Here, the branches and labels need to also travel with the cmpsi and movsi. Greg
How to use a fast scratchpad-RAM for fill/spill ?
I have a port for a multi-processor with high-latency memory accesses, even for cache hits. Each CPU core has a small private scratchpad RAM with 1 cycle access. I'd like to persuade GCC to use the scratchpad (I'll probably allocate somewhere between 8 and 32 words) for reload, rather than stack slots which have much higher latency. I have some ill-formed ideas about how to do this, which could involve describing these as another class of register, only movable in/out of general registers. I'm still trying to understand secondary-reload well enough to determine if that's the mechanism I want. Comments & suggestions are welcome! Pithy clues (e.g., "Look at the port for machine XYZ") are fine. I can dig-out the details if given broad hints. Greg
Re: How to use a fast scratchpad-RAM for fill/spill ?
Daniel Jacobowitz <[EMAIL PROTECTED]> writes: > ... Or you could try telling the entire compiler to treat them as > registers, instead of just reload. That's likely to work as well or > better. So, I define these as a separate register class, and only the movM insn patterns get constraints that match them, right? Anything else? Should I tack them onto the end of REG_ALLOC_ORDER, or leave them off? Greg
Insn for direct increment of memory?
I'm working with a machine that has a memory-increment insn. It's a network-processor performance hack that allows no-latency accumulation of statistical counters. The insn sends the increment and address to the memory controller which does the add, avoiding the usual long-latency read-increment-write cycle. I would like to persuade GCC to emit this insn. Maybe it could be done in the combiner? Do any GCC ports have this feature? Greg
Re: Insn for direct increment of memory?
Paul Brook <[EMAIL PROTECTED]> writes: > It should just work if you have the appropriate movsi pattern/alternative. > m68k has an memory-increment instruction (aka add :-). Touche. I've had my head in RISC-land too long... 8^) G
How to deal with 48-bit pointers and 32-bit integers
I'm doing a port for an unusual new machine which is 32-bit RISCy in every way, except that it has 48-bit pointers. Pointers have a high-order 16-bit segID and low-order 32-bit seg offset. Most ALU instructions only work on 32 bits, zeroing the upper 16-bit seg ID in the result. A few ALU ops intended for pointers preserve the segID. Loads/stores to pointers with segID=0 cause an exception. The idea is to catch bugs where scalars are erroneously used as pointers. For sake of efficiency, GCC can assume that segIDs of pointers are identical for pointer arithmetic: there won't be any data objects that span segments, and pointer comparisons will always be intra-segment. I chose to define Pmode as PDImode, and write PDI patterns for pointer moves & arithmetic. POINTER_SIZE is 64 bits, UNITS_PER_WORD is 4. FUNCTION_ARG_ADVANCE arranges for both SImode and PDImode values to occupy a single register. I have the port mostly working (passes 90% of execution tests), but find myself painted into a corner in some cases. What currently vexes me is when GCC wants to promote a PDImode register (say r1) to DImode, then needs to truncate down to SImode for some kind of ALU op, say pointer subtraction. The desired quantity is the low-order 32 bits of r1, but GCC thinks the promotion to DImode implies a pair of 32-bit regs (r1, r2) and since this is a big-endian machine, it wants to deliver the low-order bits as the subreg r2. I now wonder if I can salvage my overall approach, or if I need to do things an entirely different way. I fear I might be in uncharted territory since after cursory review, I don't see any existing ports where pointers need special handling and are larger than the native int size. Comments? Advice? Greg
redundant divmodsi4 not optimized away
I have a port without div or mod machine instructions. I wrote divmodsi4 patterns that do the libcall directly, hoping that GCC would recognize the opportunity to use a single divmodsi4 to compute both quotient and remainder. Alas, GCC calls divmodsi4 twice with the same divisor and dividend operands. Is this supposed to work? Is there a special trick to help the optimizer recognize the redundant insn? I saw the 4yr-old thread regarding picochip's desire for the same effect and followed the same approach implemented in the current picochip.md (as well as my own approach) but no luck. G
Re: redundant divmodsi4 not optimized away
On 04/26/10 22:09, Ian Lance Taylor wrote: Greg McGary writes: I have a port without div or mod machine instructions. I wrote divmodsi4 patterns that do the libcall directly, hoping that GCC would recognize the opportunity to use a single divmodsi4 to compute both quotient and remainder. Alas, GCC calls divmodsi4 twice with the same divisor and dividend operands. Is this supposed to work? Is there a special trick to help the optimizer recognize the redundant insn? I saw the 4yr-old thread regarding picochip's desire for the same effect and followed the same approach implemented in the current picochip.md (as well as my own approach) but no luck. Using a divmodsi4 insn instead of divsi3/modsi3 insns ought to work. You may need to give more information, such as the test case you are using, and what your divmodsi4 insn looks like. Ian The test case is __udivmoddi4 from libgcc2.c, specifically the macro __udiv_qrnnd_c from longlong.h, which does this: __r1 = (n1) % __d1; __q1 = (n1) / __d1; ... and this ... __r0 = __r1 % __d1; __q0 = __r1 / __d1; Below is my original insn set. The __udivmodsi4 libcall accepts operands in r1/r2, then returns quotient in r4 and remainder in r1 (define_insn_and_split "udivmodsi4" [(set (match_operand:SI 0 "gen_reg_operand" "=r") (udiv:SI (match_operand:SI 1 "gen_reg_operand" "r") (match_operand:SI 2 "gen_reg_operand" "r"))) (set (match_operand:SI 3 "gen_reg_operand" "=r") (umod:SI (match_dup 1) (match_dup 2))) (clobber (reg:SI 1)) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:SI 4)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))] "" "#" "reload_completed" [(set (reg:SI 1) (match_dup 1)) (set (reg:SI 2) (match_dup 2)) (parallel [(set (reg:SI 4) (udiv:SI (reg:SI 1) (reg:SI 2))) (set (reg:SI 1) (umod:SI (reg:SI 1) (reg:SI 2))) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))]) (set (match_dup 0) (reg:SI 4)) (set (match_dup 3) (reg:SI 1))]) (define_insn "*udivmodsi4_libcall" [(set (reg:SI 4) (udiv:SI (reg:SI 1) (reg:SI 2))) (set (reg:SI 1) (umod:SI (reg:SI 1) (reg:SI 2))) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))] "" "call\\t__udivmodsi4" [(set_attr "length""4")]) Here is an alternative patterned after the approach in picochip.md. I had hoped since the picochip guys reported the same trouble four years ago, the current picochip.md might have the magic bits. (define_expand "udivmodsi4" [(parallel [(set (reg:SI 1) (match_operand:SI 1 "gen_reg_operand" "r")) (clobber (reg:CC CC_REGNUM))]) (parallel [(set (reg:SI 2) (match_operand:SI 2 "gen_reg_operand" "r")) (clobber (reg:CC CC_REGNUM))]) (parallel [(unspec_volatile [(const_int 0)] UNSPEC_UDIVMOD) (set (reg:SI 4) (udiv:SI (reg:SI 1) (reg:SI 2))) (set (reg:SI 1) (umod:SI (reg:SI 1) (reg:SI 2))) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))]) (set (match_operand:SI 0 "gen_reg_operand" "=r") (reg:SI 4)) (set (match_operand:SI 3 "gen_reg_operand" "=r") (reg:SI 1))]) (define_insn "*udivmodsi4_libcall" [(unspec_volatile [(const_int 0)] UNSPEC_UDIVMOD) (set (reg:SI 4) (udiv:SI (reg:SI 1) (reg:SI 2))) (set (reg:SI 1) (umod:SI (reg:SI 1) (reg:SI 2))) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))] "" "call\\t__udivmodsi4" [(set_attr "length""4")]) Alas, neither of them eliminates the redundant libcall. If no clues are forthcoming, I'll begin debugging CSE. G
Re: redundant divmodsi4 not optimized away
On 04/28/10 05:58, Michael Matz wrote: On Tue, 27 Apr 2010, Greg McGary wrote: (define_insn "*udivmodsi4_libcall" [(set (reg:SI 4) (udiv:SI (reg:SI 1) (reg:SI 2))) (set (reg:SI 1) (umod:SI (reg:SI 1) (reg:SI 2))) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))] "" "call\\t__udivmodsi4" [(set_attr "length""4")]) So, this pattern uses r2 and clobbers r2+r3. Two calls in a row can't be eliminated because the execution of one destroys one operand of the other as far as GCC knows, and the necessary copies to reload the correct value into r2 before the second call might confuse combine/CSE/DCE/whatever. At least that would be my theory to start from :) The libcall insn above appears only after reload, as the result of a split. All the CSE passes occur before reload when the insn pattern is this: [(set (match_operand:SI 0 "gen_reg_operand" "=r") (udiv:SI (match_operand:SI 1 "gen_reg_operand" "r") (match_operand:SI 2 "gen_reg_operand" "r"))) (set (match_operand:SI 3 "gen_reg_operand" "=r") (umod:SI (match_dup 1) (match_dup 2))) (clobber (reg:SI 1)) (clobber (reg:SI 2)) (clobber (reg:SI 3)) (clobber (reg:SI 4)) (clobber (reg:CC CC_REGNUM)) (clobber (reg:SI RETURN_POINTER_REGNUM))] G
where are caller-save addresses legitimized?
reload() > setup_save_areas() > assign_stack_local_1() creates a mem address whose offset too large to fit into the machine insn's offset operand. Later, reload() > save_call_clobbered_regs() > insert_save() > adjust_address_1() > change_address_1() asserts because the address is not legitimate. My port defines all the address legitimizing target hooks, but none are called with the address in question. Where/how is the address supposed to be fixed-up in this case? Or, where/how does gcc avoid producing an illegitimate address in the first place? G
Re: where are caller-save addresses legitimized?
On 05/05/10 20:21, Jeff Law wrote: On 05/05/10 17:45, Greg McGary wrote: reload()> setup_save_areas()> assign_stack_local_1() creates a mem address whose offset too large to fit into the machine insn's offset operand. Later, reload()> save_call_clobbered_regs()> insert_save() adjust_address_1()> change_address_1() asserts because the address is not legitimate. My port defines all the address legitimizing target hooks, but none are called with the address in question. Where/how is the address supposed to be fixed-up in this case? Or, where/how does gcc avoid producing an illegitimate address in the first place? I'm not sure they are ever legitimized -- IIRC caller-save tries to only generate addressing modes which are safe for precisely this reason. Apparently not so: caller save is quite capable of producing invalid offsets. Perhaps my port needs some hook to help GCC produce good addresses? I've been looking, but haven't found it yet... G
Re: where are caller-save addresses legitimized?
On 05/05/10 21:27, Jeff Law wrote: On 05/05/10 21:34, Greg McGary wrote: On 05/05/10 20:21, Jeff Law wrote: I'm not sure they are ever legitimized -- IIRC caller-save tries to only generate addressing modes which are safe for precisely this reason. Apparently not so: caller save is quite capable of producing invalid offsets. Perhaps my port needs some hook to help GCC produce good addresses? I've been looking, but haven't found it yet... Try != successful :( You might want to look at his code in init_caller_save: Unfortunately, that didn't yield any clues. I'll proceed by building some well-established RISCy target and see what it does in similar circumstances. G
insns for register-move between general and floating
I'm working on a port that has instructions to move bits between 64-bit floating-point and 64-bit general-purpose regs. I say "bits" because there's no conversion between float and int: the bit pattern is unaltered. Therefore, it's possible to use scratch FPRs for spilling GPRs & vice-versa, and float<->int conversions need not go through memory. Among all the knobs to turn regarding register classes, reload classes, and modes+constraints on movM, floatMN2, fixMN2 patterns, I need some advice on how to do this properly. Thanks! Greg
IRA and two-phase load/store
I'm working on a port that does loads & stores in two phases. Every load/store is funneled through the intermediate registers "ld" and "st" standing between memory and the rest of the register file. Example: ld=4(rB) ... ... rC=ld st=rD 8(rB)=st rB is a base address register, rC and rD are data regs. The ... represents load delay cycles. The CPU has only a single instance of "ld", but the machine description defines five in order to allow overlapping live ranges to pipeline loads. My mov insn patterns have constraints so that a memory destination pairs with the "st" register source, and a memory source pairs with "ld" destination reg. The trouble is that register allocation doesn't understand the constraint, so it loads/stores from/to random data registers. Is there a way to confine register allocation to the "ld" and "st" classes, or is it better to let IRA do what it wants, then fixup after reload with splits to turn single insn rC=MEM into the insn pair ld=MEM ... rC=ld ? Greg
Re: IRA and two-phase load/store
On 04/27/12 14:31, Greg McGary wrote: > I'm working on a port that does loads & stores in two phases. > Every load/store is funneled through the intermediate registers "ld" and "st" > standing between memory and the rest of the register file. > > Example: > ld=4(rB) > ... > ... > rC=ld > > st=rD > 8(rB)=st > > rB is a base address register, rC and rD are data regs. The ... represents > load delay cycles. > > The CPU has only a single instance of "ld", but the machine description > defines five in order to allow overlapping live ranges to pipeline loads. > > My mov insn patterns have constraints so that a memory destination pairs with > the "st" register source, and a memory source pairs with "ld" destination > reg. The trouble is that register allocation doesn't understand the > constraint, so it loads/stores from/to random data registers. Clarification: I understand that IRA will do this, but I also thought that reload was supposed to notice that the insn didn't match its constraints and emit reg copies in order to fixup. It doesn't do that for me--postreload just asserts, complaining that the insn doesn't match its constraints. > Is there a way to confine register allocation to the "ld" and "st" classes, > or is it better to let IRA do what it wants, then fixup after reload with > splits to turn single insn rC=MEM into the insn pair ld=MEM ... rC=ld ? > > Greg
Maybe expand MAX_RECOG_ALTERNATIVES ?
I'm working on a DSP port whose unit reservations are very sensitive to operand signature. E.g., for an assembler mnemonic, there can be 35-50 different combinations of operand register classes, each having different impacts on latencies and function units. For assembler code generation, very few constraint alternatives are needed, but for the DFA pipeline description, many constraint alternatives could be handy. The maximum is currently 30, and the implementation of genattrtab would need surgery to accommodate more. My question is this: does it make sense to double MAX_RECOG_ALTERNATIVES so that I can use insn attributes to identify operand signatures, or should I use another approach? The advantage is (presumably) lower overhead at scheduling time--once operands are constrained, then finding the reservation comes cheaply. The disadvantage is that constrain_operands() is a pig, and adding alternatives could slow it down more than it would have cost to have heavier weight predicates in define_insn_reservation. Also, having so many constraints is unwieldy for define_insn, though I have found the editing job to be reasonable when I work full-screen with 200+ columns :-). Even if I wanted to expand MAX_RECOG_ALTERNATIVES, if no other port wants or needs them, then a patch to genattrtab.c might not be welcome. Before I spend any time on genattrtab, I'd like to know now if it has any hope of being accepted. G
Re: Maybe expand MAX_RECOG_ALTERNATIVES ?
On 05/11/12 16:00, Greg McGary wrote: > My question is this: does it make sense to double MAX_RECOG_ALTERNATIVES so > that I can use insn attributes to identify operand signatures, or should I use > another approach? After some exploration, I don't see that another approach is even possible. The predicates in define_insn_reservation must be statically evaluated by genattrtab, so I can't use (match_test "...") or (symbol_ref "..."), where "..." is arbitrary C code. Is it true that define_insn_reservation predicates can only use boolean expressions on (eq_attr ...), or am I missing something? G
INSN_EXACT_TICK & scheduler backtrack
When the timing requirements are not met upon queueing an insn with INSN_EXACT_TICK, the scheduler backtracks. This seems wasteful. Why not prioritize INSN_EXACT_TICK insns so that we queue them first on the cycle they need?
Dependences for call-preserved regs on exposed pipeline target?
I'm working onaport to a VLIW DSP with anexposed pipeline (i.e., no interlocks). Some operations OPhave as much as 2-cycle latency on values of the call-preserved regs CPR. E.g., if the callee's epiloguerestores a CPR in the delay slot of the return instruction, then any OP with that CPR as input needs to schedule 2 clocks after the call in order to get the expected value. If OP schedules immediately after the call, then it will getthevalue the callee's value prior to the epilogue restore. The easy, low-performance way to solve the problem is to schedule epilogues to restore CPRs before the return and its delay slot. The harder, usually better performing way is to manage dependences in the caller so that uses of CPRs for OPs that require extra cycles schedule at sufficient distance from the call. How shall I introduce these dependences for only the scheduler? As an experiment, I added CLOBBERs to the call insn, which createdtrue depencences between the call and downstream instructions that read the CPRs, but had the undesired effect of perturbing dataflowacross calls. I'm thinking sched-depsneedsnew code for targets with TARGET_SCHED_EXPOSED_PIPELINE to add dependencesfor call-insn producers and CPR-user consumers. Comments? Greg
Re: Dependences for call-preserved regs on exposed pipeline target?
On 11/25/12 23:33, Maxim Kuvyrkov wrote: > You essentially need a fix-up pass just before the end of compilation > (machine-dependent reorg, if memory serves me right) to space instructions > consuming values from CPRs from the CALL_INSNS that set those CPRs. I.e., > for the 99% of compilation you don't care about this restriction, it's only > the very last VLIW bundling and delay slot passes that need to know about it. > > You, probably, want to make the 2nd scheduler pass run as machine-dependent > reorg (as ia64 does) and enable an additional constraint (through scheduling > bypass) for the scheduler DFA to space CALL_INSNs from their consumers for at > least for 2 cycles. One challenge here is that scheduler operates on basic > blocks, and it is difficult to track dependencies across basic block > boundaries. To workaround basic-block scope of the scheduler you could emit > dummy instructions at the beginning of basic blocks that have predecessors > that end with CALL_INSNs. These dummy instructions would set the appropriate > registers (probably just assign the register to itself), and you will have a > bypass (see define_bypass) between these dummy instructions and consumers to > guarantee the 2-cycle delay. Thanks for the advice. We're already on the same page--I have most of what you recommend: I only schedule once from machine_dependent_reorg, after splitting loads/stores, calls/branches into "init" and "fini" phases bound at fixed clock offsets by record_delay_slot_pair(). I already have a fixup pass to handle inter-EBB hazards. (The selective scheduler would handle interblock automatically, but I had trouble with it initially with split load/stores. I want to revisit that.) Regarding CPRs, I strongly desire to avoid kludgy fixups for schedules created with an incomplete dependence graph when the generic scheduler can do the job perfectly with a complete dependence graph. G
Re: Dependences for call-preserved regs on exposed pipeline target?
On 11/26/12 12:46, Maxim Kuvyrkov wrote: > I wonder if "kludgy fixups" refers to the dummy-instruction solution I > mentioned above. The complete dependence graph is a myth. You cannot have a > complete dependence graph for a function -- scheduler works on DAG regions > (and I doubt it will ever support anything more complex), so you would have > to do something to account for inter-region dependencies anyway. > > It is simpler to have a unified solution that would handle both inter- and > intra-region dependencies, rather than implementing two different approaches. I retract any implication that your bypass proposal is a kludge. I found using bypasses to be very compact and effective. Thanks for the extra nudge. G
Trouble with powerpc64 mfpgpr patch
I extracted the MFPGPR hunks from Peter Bergner's "[PATCH] Add POWER6 machine description", posted on 2006-11-01 and dropped them into gcc-4.0.3, but the result fails with "error: insn does not satisfy its constraints": .../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c: In function '__gcc_qadd': .../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:127: error: insn does not satisfy its constraints: (insn 193 64 152 4 .../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:110 (set (reg:DF 10 10) (plus:DF (reg:DF 34 2 [orig:124 D.1365 ] [124]) (reg:DF 32 0 [146]))) 177 {*adddf3_fpr} (nil) (expr_list:REG_DEAD (reg:DF 34 2 [orig:124 D.1365 ] [124]) (expr_list:REG_DEAD (reg:DF 32 0 [146]) (nil .../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:127: internal compiler error: in copyprop_hardreg_forward_1, at regrename.c:1583 Please submit a full bug report, with preprocessed source if appropriate. The complaint is about operand[0], which is an integer register with DFmode. FYI, I did my own work to put mftgpr/mffgpr into GCC-4.0.x last year, and ran into this same problem. I solved it by changing the operand predicates everywhere I found an "f" constraint so that the predicate only allowed FP regs rather than the permissive gpc_reg_operand. Alhough this worked, I didn't like it because the change was very invasive, so when I saw that Peter's patch didn't muck with the FP operand predicates, I wanted to use it instead. Alas, I have the same problem with integer registers matching gpc_reg_operand, but not satisfying the "f" constraint. What am I missing? Is there something inhospitable about GCC-4.0 vs. the trunk for Peter's changes? The patch I used is attached. Thanks, Greg 2006-11-01 Pete Steinmetz <[EMAIL PROTECTED]> Peter Bergner <[EMAIL PROTECTED]> * config/rs6000/rs6000.md (define_attr "type"): Add mffgpr and mftgpr attributes. (floatsidf2,fix_truncdfsi2): use TARGET_MFPGPR. (fix_truncdfsi2_mfpgpr): New. (floatsidf_ppc64_mfpgpr): New. (floatsidf_ppc64): Added !TARGET_MFPGPR condition. (movdf_hardfloat64_mfpgpr,movdi_mfpgpr): New. (movdf_hardfloat64): Added !TARGET_MFPGPR condition. (movdi_internal64): Added !TARGET_MFPGPR and related conditions. * config/rs6000/rs6000.h (TARGET_MFPGPR): New. (SECONDARY_MEMORY_NEEDED): Use TARGET_MFPGPR. (SECONDARY_MEMORY_NEEDED): Added mode!=DFmode and mode!=DImode conditions. Index: gcc-4.0.3/gcc/config/rs6000/rs6000.h === --- gcc-4.0.3.orig/gcc/config/rs6000/rs6000.h +++ gcc-4.0.3/gcc/config/rs6000/rs6000.h @@ -201,6 +201,9 @@ extern int target_flags; /* Use single field mfcr instruction. */ #define MASK_MFCRF 0x0008 +/* Use FP <-> GP register moves. */ +#define MASK_MFPGPR0x0020 + /* The only remaining free bits are 0x0060. linux64.h uses 0x0010, and sysv4.h uses 0x0080 -> 0x4000. 0x8000 is not available because target_flags is signed. */ @@ -223,6 +226,7 @@ extern int target_flags; #define TARGET_SCHED_PROLOG(target_flags & MASK_SCHED_PROLOG) #define TARGET_ALTIVEC (target_flags & MASK_ALTIVEC) #define TARGET_AIX_STRUCT_RET (target_flags & MASK_AIX_STRUCT_RET) +#define TARGET_MFPGPR (target_flags & MASK_MFPGPR) /* Define TARGET_MFCRF if the target assembler supports the optional field operand for mfcr and the target processor supports the @@ -234,7 +238,6 @@ extern int target_flags; #define TARGET_MFCRF 0 #endif - #define TARGET_32BIT (! TARGET_64BIT) #define TARGET_HARD_FLOAT (! TARGET_SOFT_FLOAT) #define TARGET_UPDATE (! TARGET_NO_UPDATE) @@ -365,6 +368,10 @@ extern int target_flags; N_("Generate single field mfcr instruction")}, \ {"no-mfcrf", - MASK_MFCRF, \ N_("Do not generate single field mfcr instruction")},\ + {"mfpgpr", MASK_MFPGPR,\ + N_("Generate moves between floating and general registers")}, \ + {"no-mfpgpr",- MASK_MFPGPR, \ + N_("Do not generate moves between floating and general registers")},\ SUBTARGET_SWITCHES \ {"", TARGET_DEFAULT | MASK_SCHED_PROLOG, \ ""}} @@ -1413,12 +1420,18 @@ enum reg_class secondary_reload_class (CLASS, MODE, IN) /* If we are copying between FP or AltiVec registers and anything - else, we need a memory location. */ - -#define SECONDARY_MEMORY_NEEDED(CLASS1,CLASS2,MODE)\ - ((CLASS1) != (CLASS2) && ((CLASS1) == FLOAT_REGS \ - || (CLASS2) == FL
[RISC-V] vector segment load/store width as a riscv_tune_param
I am revisiting an effort to make the number of lanes for vector segment load/store a tunable parameter. A year ago, Robin added minimal and not-yet-tunable common_vector_cost::segment_permute_[2-8] Some issues & questions: * Since this pertains only to segment load/store, why is the word "permute" in the name? * Nit: why are these defined as individual members rather than an array referenced as segment_permute[NF-2]? * I implemented tuning as a simple threshold for max NF where segment load/store is profitable. Test cases for vector segment store pass, but tests for load fail. I found that common_cost_vector::segment_permute is properly honored in the store case, but not even inspected in the load case. I will need to spelunk the autovec cost model. Clues are welcome. G
Re: [RISC-V] vector segment load/store width as a riscv_tune_param
On Wed, Mar 26, 2025 at 1:44 AM Robin Dapp wrote: > > You won't see failures in the testsuite. The failures only show-up when I > > attempt to impose huge costs on NF above threshold. A quick & dirty way > to > > expose the bug is apply the appended patch, then observe that you get > output > > from this only for mask_struct_store-*.c and not for mask_struct_load-*.c > > I suppose that's due to Richi's restructuring of the vector/SLP code. > What > might work is (untested): > It's a winner for my tests! Gracias. G
Re: [RISC-V] vector segment load/store width as a riscv_tune_param
On Tue, Mar 25, 2025 at 2:47 AM Robin Dapp wrote: > > A year ago, Robin added minimal and not-yet-tunable > > common_vector_cost::segment_permute_[2-8] > > But it is tunable, just not a param? :) I meant "param" generically, not necessarily a command-line --param=thingy, though point taken! :) > We have our own cost structure in our > downstream repo, adjusted to our uarch. I suggest you do the same or > upstream > a separate cost structure. I don't think anybody would object to having > several of those, one for each uarch (as long as they are sufficiently > distinct). > Yes, this is what I meant by not-yet-tunable, there is currently no datapath between -mcpu/-mtune and common_vector_cost::segment_permute_*. All CPUs get the same hard-coded value of 1 for all segment_permute_* costs. > BTW, just tangentially related and I don't know how sensitive your uarch > is to > scheduling, but with the x264 SAD and other sched issues we have seen you > might > consider disabling sched1 as well for your uarch? I know that for our > uarch we > want to keep it on but we surely could have another generic-like mtune > option > that disables it (maybe even generic-ooo and change the current > generic-ooo to > generic-in-order?). I would expect this to get more common in the future > anyway. Thanks for the tip. We will look into it. > > Some issues & questions: > > > > * Since this pertains only to segment load/store, why is the word > "permute" > > in the name? > > The vectorizer already performs costing for the segment loads/stores (IIRC > as > simple loads, though). At some point the idea was to explicitly model the > "segment permute/transpose" as a separate operation i.e. > This is a different concept, so I ought to introduce a new cost param which is the threshold value of NF for fast vs. slow. > * I implemented tuning as a simple threshold for max NF where segment > > load/store is profitable. Test cases for vector segment store pass, but > > tests for load fail. I found that common_cost_vector::segment_permute > is > > properly honored in the store case, but not even inspected in the load > > case. I will need to spelunk the autovec cost model. Clues are welcome. > > Could you give an example for that? Might just be a bug. > Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I > see > that the cost is adjusted for loads, though. You won't see failures in the testsuite. The failures only show-up when I attempt to impose huge costs on NF above threshold. A quick & dirty way to expose the bug is apply the appended patch, then observe that you get output from this only for mask_struct_store-*.c and not for mask_struct_load-*.c G --- a/gcc/config/riscv/riscv-vector-costs.cc +++ b/gcc/config/riscv/riscv-vector-costs.cc @@ -1140,6 +1140,7 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop, int group_size = segment_loadstore_group_size (kind, stmt_info); if (group_size > 1) { + fprintf (stderr, "segment_loadstore_group_size = %d\n", group_size); switch (group_size) { case 2: