emit_no_conflict_block breaks some conditional moves

2005-04-20 Thread Greg McGary
My port failed the DImode part of the rotate regression-tests
(gcc.c-torture/execute/20020508-[123].c).  I found that
emit_no_conflict_block() reordered insns gen'd by
expand_doubleword_shift() in a way that violated dependency between
compares and associated conditional-move insns that had the target
register as destination.  AFAICT, any other port (arc, m32r, v850,
xtensa) that emits a cmpsi followed by movsicc and has no native
DImode shift insns will be subject to this bug also.

Any hints on the proper approach?  My initial idea is to make
emit_no_conflict_block() maintain pairing between cmpsi and movsicc,
which will work as long as cmpsi's operands are never clobbered.
Ultimately, I'll side-step the bug by defining expands or splits for
DImode shifts & rotates, but I'd like to see emit_no_conflict_block()
fixed.

Comments?

Greg


Re: emit_no_conflict_block breaks some conditional moves

2005-04-23 Thread Greg McGary
James E Wilson <[EMAIL PROTECTED]> writes:

> Greg McGary wrote:
> > I found that
> > emit_no_conflict_block() reordered insns gen'd by
> > expand_doubleword_shift() in a way that violated dependency between
> > compares and associated conditional-move insns that had the target
> > register as destination.
> 
> You didn't say precisely what went wrong, but I'd guess you have
>  cmpsi ...
>  movsicc target, ...
>  cmpsi ...
>  movsicc target, ...
> which got reordered to
>  cmpsi ...
>  cmpsi ...
>  movsicc target, ...
>  movsicc target, ...
> which obviously does not work if your condition code register is a
> hard register.

Correct.  FYI, the two "cmpsi" insns are identical and redundant, so
don't conflict, however, all bit-logic and shift insns on this CPU
clobber condition codes, and the CC-producing cmpsi insns are
separated from their consumers by CC-clobbering logic & shift insns.

> Perhaps a check like
>  && GET_MODE_CLASS (GET_MODE (SET_DEST (set))) != MODE_CC
> or maybe check for any hard register
>  && (SET_DEST (set) != REG
>  || REGNO (set) >= FIRST_PSEUDO_REGISTER)
> Safer is probably to do both checks, so that we only reject CCmode
> hard regs here, e.g.
>  && (GET_MODE_CLASS (GET_MODE (SET_DEST (set))) != MODE_CC
>  || SET_DEST (set) != REG
>  || REGNO (set) >= FIRST_PSEUDO_REGISTER))
> which should handle the specific case you ran into.

That will do fine for ports that have conditional move, but without
movsicc, you'll have this case:

cmpsi ...
bcc 1f
movsi target, ...
  1:
cmpsi ...
bcc 2f
movsi target, ...
  2:


which without the above fix will be reordered:

cmpsi ...
bcc 1f
  1:
cmpsi ...
bcc 2f
  2:

movsi target, ...
movsi target, ...

while with the above fix, will be reordered:

bcc 1f
  1:  
bcc 2f
  2:

cmpsi ...
movsi target, ...
cmpsi ...
movsi target, ...

Here, the branches and labels need to also travel with the cmpsi and movsi.

Greg


How to use a fast scratchpad-RAM for fill/spill ?

2005-05-11 Thread Greg McGary
I have a port for a multi-processor with high-latency memory accesses,
even for cache hits.  Each CPU core has a small private scratchpad RAM
with 1 cycle access.  I'd like to persuade GCC to use the scratchpad
(I'll probably allocate somewhere between 8 and 32 words) for reload,
rather than stack slots which have much higher latency.  I have some
ill-formed ideas about how to do this, which could involve describing
these as another class of register, only movable in/out of general
registers.  I'm still trying to understand secondary-reload well
enough to determine if that's the mechanism I want.

Comments & suggestions are welcome!  Pithy clues (e.g., "Look at
the port for machine XYZ") are fine.  I can dig-out the details if
given broad hints.

Greg


Re: How to use a fast scratchpad-RAM for fill/spill ?

2005-05-11 Thread Greg McGary
Daniel Jacobowitz <[EMAIL PROTECTED]> writes:

> ... Or you could try telling the entire compiler to treat them as
> registers, instead of just reload.  That's likely to work as well or
> better.

So, I define these as a separate register class, and only the movM
insn patterns get constraints that match them, right?  Anything else?
Should I tack them onto the end of REG_ALLOC_ORDER, or leave them
off?

Greg


Insn for direct increment of memory?

2005-09-24 Thread Greg McGary
I'm working with a machine that has a memory-increment insn.  It's a
network-processor performance hack that allows no-latency accumulation
of statistical counters.  The insn sends the increment and address to
the memory controller which does the add, avoiding the usual
long-latency read-increment-write cycle.  I would like to persuade GCC
to emit this insn.  Maybe it could be done in the combiner?  Do any
GCC ports have this feature?

Greg


Re: Insn for direct increment of memory?

2005-09-24 Thread Greg McGary
Paul Brook <[EMAIL PROTECTED]> writes:

> It should just work if you have the appropriate movsi pattern/alternative. 
> m68k has an memory-increment instruction (aka add :-).

Touche.  I've had my head in RISC-land too long...  8^)

G


How to deal with 48-bit pointers and 32-bit integers

2009-08-12 Thread Greg McGary
I'm doing a port for an unusual new machine which is 32-bit RISCy in 
every way, except that it has 48-bit pointers.  Pointers have a 
high-order 16-bit segID and low-order 32-bit seg offset.  Most ALU 
instructions only work on 32 bits, zeroing the upper 16-bit seg ID in 
the result.  A few ALU ops intended for pointers preserve the segID.  
Loads/stores to pointers with segID=0 cause an exception.  The idea is 
to catch bugs where scalars are erroneously used as pointers.  For sake 
of efficiency, GCC can assume that segIDs of pointers are identical for 
pointer arithmetic: there won't be any data objects that span segments, 
and pointer comparisons will always be intra-segment.


I chose to define Pmode as PDImode, and write PDI patterns for pointer 
moves & arithmetic.  POINTER_SIZE is 64 bits, UNITS_PER_WORD is 4.  
FUNCTION_ARG_ADVANCE arranges for both SImode and PDImode values to 
occupy a single register.  I have the port mostly working (passes 90% of 
execution tests), but find myself painted into a corner in some cases.  
What currently vexes me is when GCC wants to promote a PDImode register 
(say r1) to DImode, then needs to truncate down to SImode for some kind 
of ALU op, say pointer subtraction.  The desired quantity is the 
low-order 32 bits of r1, but GCC thinks the promotion to DImode implies 
a pair of 32-bit regs (r1, r2) and since this is a big-endian machine, 
it wants to deliver the low-order bits as the subreg r2.


I now wonder if I can salvage my overall approach, or if I need to do 
things an entirely different way.  I fear I might be in uncharted 
territory since after cursory review, I don't see any existing ports 
where pointers need special handling and are larger than the native int 
size.


Comments?  Advice?

Greg



redundant divmodsi4 not optimized away

2010-04-26 Thread Greg McGary
I have a port without div or mod machine instructions.  I wrote 
divmodsi4 patterns that do the libcall directly, hoping that GCC would 
recognize the opportunity to use a single divmodsi4 to compute both 
quotient and remainder.  Alas, GCC calls divmodsi4 twice with the same 
divisor and dividend operands.  Is this supposed to work?  Is there a 
special trick to help the optimizer recognize the redundant insn?  I saw 
the 4yr-old thread regarding picochip's desire for the same effect and 
followed the same approach implemented in the current picochip.md (as 
well as my own approach) but no luck.


G



Re: redundant divmodsi4 not optimized away

2010-04-27 Thread Greg McGary

On 04/26/10 22:09, Ian Lance Taylor wrote:

Greg McGary  writes:

   

I have a port without div or mod machine instructions.  I wrote
divmodsi4 patterns that do the libcall directly, hoping that GCC would
recognize the opportunity to use a single divmodsi4 to compute both
quotient and remainder.  Alas, GCC calls divmodsi4 twice with the same
divisor and dividend operands.  Is this supposed to work?  Is there a
special trick to help the optimizer recognize the redundant insn?  I
saw the 4yr-old thread regarding picochip's desire for the same effect
and followed the same approach implemented in the current picochip.md
(as well as my own approach) but no luck.
 

Using a divmodsi4 insn instead of divsi3/modsi3 insns ought to work.
You may need to give more information, such as the test case you are
using, and what your divmodsi4 insn looks like.

Ian
   


The test case is __udivmoddi4 from libgcc2.c, specifically
the macro __udiv_qrnnd_c from longlong.h, which does this:

__r1 = (n1) % __d1;
__q1 = (n1) / __d1;

... and this ...

__r0 = __r1 % __d1;
__q0 = __r1 / __d1;

Below is my original insn set.  The __udivmodsi4 libcall accepts
operands in r1/r2, then returns quotient in r4 and remainder in r1

(define_insn_and_split "udivmodsi4"
  [(set (match_operand:SI 0 "gen_reg_operand" "=r")
(udiv:SI (match_operand:SI 1 "gen_reg_operand" "r")
 (match_operand:SI 2 "gen_reg_operand" "r")))
   (set (match_operand:SI 3 "gen_reg_operand" "=r")
(umod:SI (match_dup 1)
 (match_dup 2)))
   (clobber (reg:SI 1))
   (clobber (reg:SI 2))
   (clobber (reg:SI 3))
   (clobber (reg:SI 4))
   (clobber (reg:CC CC_REGNUM))
   (clobber (reg:SI RETURN_POINTER_REGNUM))]
  ""
  "#"
  "reload_completed"
  [(set (reg:SI 1)
(match_dup 1))
   (set (reg:SI 2)
(match_dup 2))
   (parallel [(set (reg:SI 4)
   (udiv:SI (reg:SI 1)
(reg:SI 2)))
  (set (reg:SI 1)
   (umod:SI (reg:SI 1)
(reg:SI 2)))
  (clobber (reg:SI 2))
  (clobber (reg:SI 3))
  (clobber (reg:CC CC_REGNUM))
  (clobber (reg:SI RETURN_POINTER_REGNUM))])
   (set (match_dup 0)
(reg:SI 4))
   (set (match_dup 3)
(reg:SI 1))])

(define_insn "*udivmodsi4_libcall"
  [(set (reg:SI 4)
(udiv:SI (reg:SI 1)
 (reg:SI 2)))
   (set (reg:SI 1)
(umod:SI (reg:SI 1)
 (reg:SI 2)))
   (clobber (reg:SI 2))
   (clobber (reg:SI 3))
   (clobber (reg:CC CC_REGNUM))
   (clobber (reg:SI RETURN_POINTER_REGNUM))]
  ""
  "call\\t__udivmodsi4"
  [(set_attr "length""4")])

Here is an alternative patterned after the approach in picochip.md.  I
had hoped since the picochip guys reported the same trouble four years
ago, the current picochip.md might have the magic bits.

(define_expand "udivmodsi4"
  [(parallel [(set (reg:SI 1)
   (match_operand:SI 1 "gen_reg_operand"  "r"))
  (clobber (reg:CC CC_REGNUM))])
   (parallel [(set (reg:SI 2)
   (match_operand:SI 2 "gen_reg_operand"  "r"))
  (clobber (reg:CC CC_REGNUM))])
   (parallel [(unspec_volatile [(const_int 0)] UNSPEC_UDIVMOD)
  (set (reg:SI 4)
   (udiv:SI (reg:SI 1)
   (reg:SI 2)))
  (set (reg:SI 1)
   (umod:SI (reg:SI 1)
   (reg:SI 2)))
  (clobber (reg:SI 2))
  (clobber (reg:SI 3))
  (clobber (reg:CC CC_REGNUM))
  (clobber (reg:SI RETURN_POINTER_REGNUM))])
   (set (match_operand:SI 0 "gen_reg_operand" "=r")
(reg:SI 4))
   (set (match_operand:SI 3 "gen_reg_operand" "=r")
(reg:SI 1))])

(define_insn "*udivmodsi4_libcall"
  [(unspec_volatile [(const_int 0)] UNSPEC_UDIVMOD)
   (set (reg:SI 4)
(udiv:SI (reg:SI 1)
 (reg:SI 2)))
   (set (reg:SI 1)
(umod:SI (reg:SI 1)
 (reg:SI 2)))
   (clobber (reg:SI 2))
   (clobber (reg:SI 3))
   (clobber (reg:CC CC_REGNUM))
   (clobber (reg:SI RETURN_POINTER_REGNUM))]
  ""
  "call\\t__udivmodsi4"
  [(set_attr "length""4")])


Alas, neither of them eliminates the redundant libcall.  If no clues
are forthcoming, I'll begin debugging CSE.

G



Re: redundant divmodsi4 not optimized away

2010-04-28 Thread Greg McGary

On 04/28/10 05:58, Michael Matz wrote:


On Tue, 27 Apr 2010, Greg McGary wrote:
   

(define_insn "*udivmodsi4_libcall"
   [(set (reg:SI 4)
 (udiv:SI (reg:SI 1)
  (reg:SI 2)))
(set (reg:SI 1)
 (umod:SI (reg:SI 1)
  (reg:SI 2)))
(clobber (reg:SI 2))
(clobber (reg:SI 3))
(clobber (reg:CC CC_REGNUM))
(clobber (reg:SI RETURN_POINTER_REGNUM))]
   ""
   "call\\t__udivmodsi4"
   [(set_attr "length""4")])
 

So, this pattern uses r2 and clobbers r2+r3.  Two calls in a row can't be
eliminated because the execution of one destroys one operand of the other
as far as GCC knows, and the necessary copies to reload the correct value
into r2 before the second call might confuse combine/CSE/DCE/whatever.  At
least that would be my theory to start from :)
   


The libcall insn above appears only after reload, as the result of a 
split.  All the CSE passes occur before reload when the insn pattern is 
this:


  [(set (match_operand:SI 0 "gen_reg_operand" "=r")
(udiv:SI (match_operand:SI 1 "gen_reg_operand" "r")
 (match_operand:SI 2 "gen_reg_operand" "r")))
   (set (match_operand:SI 3 "gen_reg_operand" "=r")
(umod:SI (match_dup 1)
 (match_dup 2)))
   (clobber (reg:SI 1))
   (clobber (reg:SI 2))
   (clobber (reg:SI 3))
   (clobber (reg:SI 4))
   (clobber (reg:CC CC_REGNUM))
   (clobber (reg:SI RETURN_POINTER_REGNUM))]

G



where are caller-save addresses legitimized?

2010-05-05 Thread Greg McGary
reload() > setup_save_areas() > assign_stack_local_1() creates a mem 
address whose offset too large to fit into the machine insn's offset 
operand.  Later, reload() > save_call_clobbered_regs() > insert_save() > 
adjust_address_1() > change_address_1() asserts because the address is 
not legitimate.


My port defines all the address legitimizing target hooks, but none are 
called with the address in question.  Where/how is the address supposed 
to be fixed-up in this case?  Or, where/how does gcc avoid producing an 
illegitimate address in the first place?


G



Re: where are caller-save addresses legitimized?

2010-05-05 Thread Greg McGary

On 05/05/10 20:21, Jeff Law wrote:

On 05/05/10 17:45, Greg McGary wrote:
   

reload()>  setup_save_areas()>  assign_stack_local_1() creates a mem
address whose offset too large to fit into the machine insn's offset
operand.  Later, reload()>  save_call_clobbered_regs()>  insert_save()
 

adjust_address_1()>  change_address_1() asserts because the address
   

is not legitimate.

My port defines all the address legitimizing target hooks, but none
are called with the address in question.  Where/how is the address
supposed to be fixed-up in this case?  Or, where/how does gcc avoid
producing an illegitimate address in the first place?
 

I'm not sure they are ever legitimized -- IIRC caller-save tries to only
generate addressing modes which are safe for precisely this reason.
   


Apparently not so: caller save is quite capable of producing invalid 
offsets.

Perhaps my port needs some hook to help GCC produce good addresses?
I've been looking, but haven't found it yet...

G



Re: where are caller-save addresses legitimized?

2010-05-07 Thread Greg McGary

On 05/05/10 21:27, Jeff Law wrote:

On 05/05/10 21:34, Greg McGary wrote:
   

On 05/05/10 20:21, Jeff Law wrote:
 

I'm not sure they are ever legitimized -- IIRC caller-save tries to only
generate addressing modes which are safe for precisely this reason.
   

Apparently not so: caller save is quite capable of producing invalid
offsets.
Perhaps my port needs some hook to help GCC produce good addresses?
I've been looking, but haven't found it yet...
 

Try != successful :(

You might want to look at his code in init_caller_save:
   


Unfortunately, that didn't yield any clues.  I'll proceed by building 
some well-established RISCy target and see what it does in similar 
circumstances.


G



insns for register-move between general and floating

2006-03-21 Thread Greg McGary
I'm working on a port that has instructions to move bits between
64-bit floating-point and 64-bit general-purpose regs.  I say "bits"
because there's no conversion between float and int: the bit pattern
is unaltered.  Therefore, it's possible to use scratch FPRs for
spilling GPRs & vice-versa, and float<->int conversions need not go
through memory.

Among all the knobs to turn regarding register classes, reload
classes, and modes+constraints on movM, floatMN2, fixMN2 patterns,
I need some advice on how to do this properly.

Thanks!
Greg


IRA and two-phase load/store

2012-04-27 Thread Greg McGary
I'm working on a port that does loads & stores in two phases.
Every load/store is funneled through the intermediate registers "ld" and "st"
standing between memory and the rest of the register file.

Example:
ld=4(rB)
...
...
rC=ld

st=rD
8(rB)=st

rB is a base address register, rC and rD are data regs.  The ... represents
load delay cycles.

The CPU has only a single instance of "ld", but the machine description
defines five in order to allow overlapping live ranges to pipeline loads.

My mov insn patterns have constraints so that a memory destination pairs with
the "st" register source, and a memory source pairs with "ld" destination
reg.  The trouble is that register allocation doesn't understand the
constraint, so it loads/stores from/to random data registers.

Is there a way to confine register allocation to the "ld" and "st" classes,
or is it better to let IRA do what it wants, then fixup after reload with
splits to turn single insn rC=MEM into the insn pair ld=MEM ... rC=ld ?

Greg


Re: IRA and two-phase load/store

2012-04-27 Thread Greg McGary
On 04/27/12 14:31, Greg McGary wrote:
> I'm working on a port that does loads & stores in two phases.
> Every load/store is funneled through the intermediate registers "ld" and "st"
> standing between memory and the rest of the register file.
>
> Example:
> ld=4(rB)
> ...
> ...
> rC=ld
>
> st=rD
> 8(rB)=st
>
> rB is a base address register, rC and rD are data regs.  The ... represents
> load delay cycles.
>
> The CPU has only a single instance of "ld", but the machine description
> defines five in order to allow overlapping live ranges to pipeline loads.
>
> My mov insn patterns have constraints so that a memory destination pairs with
> the "st" register source, and a memory source pairs with "ld" destination
> reg.  The trouble is that register allocation doesn't understand the
> constraint, so it loads/stores from/to random data registers.

Clarification: I understand that IRA will do this, but I also thought that 
reload
was supposed to notice that the insn didn't match its constraints and emit reg
copies in order to fixup.  It doesn't do that for me--postreload just asserts,
complaining that the insn doesn't match its constraints.

> Is there a way to confine register allocation to the "ld" and "st" classes,
> or is it better to let IRA do what it wants, then fixup after reload with
> splits to turn single insn rC=MEM into the insn pair ld=MEM ... rC=ld ?
>
> Greg



Maybe expand MAX_RECOG_ALTERNATIVES ?

2012-05-11 Thread Greg McGary
I'm working on a DSP port whose unit reservations are very sensitive to
operand signature.  E.g., for an assembler mnemonic, there can be 35-50
different combinations of operand register classes, each having different
impacts on latencies and function units.  For assembler code generation, very
few constraint alternatives are needed, but for the DFA pipeline description,
many constraint alternatives could be handy.  The maximum is currently 30, and
the implementation of genattrtab would need surgery to accommodate more.

My question is this: does it make sense to double MAX_RECOG_ALTERNATIVES so
that I can use insn attributes to identify operand signatures, or should I use
another approach?  The advantage is (presumably) lower overhead at scheduling
time--once operands are constrained, then finding the reservation comes
cheaply.  The disadvantage is that constrain_operands() is a pig, and adding
alternatives could slow it down more than it would have cost to have heavier
weight predicates in define_insn_reservation.  Also, having so many
constraints is unwieldy for define_insn, though I have found the editing job
to be reasonable when I work full-screen with 200+ columns :-).

Even if I wanted to expand MAX_RECOG_ALTERNATIVES, if no other port wants or
needs them, then a patch to genattrtab.c might not be welcome.  Before I spend
any time on genattrtab, I'd like to know now if it has any hope of being 
accepted.

G



Re: Maybe expand MAX_RECOG_ALTERNATIVES ?

2012-05-11 Thread Greg McGary
On 05/11/12 16:00, Greg McGary wrote:

> My question is this: does it make sense to double MAX_RECOG_ALTERNATIVES so
> that I can use insn attributes to identify operand signatures, or should I use
> another approach?

After some exploration, I don't see that another approach is even possible.  The
predicates in define_insn_reservation must be statically evaluated by 
genattrtab,
so I can't use (match_test "...") or (symbol_ref "..."), where "..." is 
arbitrary
C code.  Is it true that define_insn_reservation predicates can only use boolean
expressions on (eq_attr ...), or am I missing something?

G



INSN_EXACT_TICK & scheduler backtrack

2012-09-13 Thread Greg McGary
When the timing requirements are not met upon queueing an insn with
INSN_EXACT_TICK, the scheduler backtracks.  This seems wasteful.
Why not prioritize INSN_EXACT_TICK insns so that we queue them
first on the cycle they need?



Dependences for call-preserved regs on exposed pipeline target?

2012-11-25 Thread Greg McGary
I'm working onaport to a VLIW DSP with anexposed pipeline (i.e., no
interlocks).  Some operations OPhave as much as 2-cycle latency on values
of the call-preserved regs CPR.  E.g., if the callee's epiloguerestores a
CPR in the delay slot of the return instruction, then any OP with that CPR
as input needs to schedule 2 clocks after the call in order to get the
expected value.  If OP schedules immediately after the call, then it will
getthevalue the callee's value prior to the epilogue restore.

The easy, low-performance way to solve the problem is to schedule
epilogues to restore CPRs before the return and its delay slot.  The
harder, usually better performing way is to manage dependences in the
caller so that uses of CPRs for OPs that require extra cycles schedule
at sufficient distance from the call.

How shall I introduce these dependences for only the scheduler?  As an
experiment, I added CLOBBERs to the call insn, which createdtrue
depencences between the call and downstream instructions that read the
CPRs, but had the undesired effect of perturbing dataflowacross calls.
I'm thinking sched-depsneedsnew code for targets with
TARGET_SCHED_EXPOSED_PIPELINE to add dependencesfor call-insn producers
and CPR-user consumers.

Comments?

Greg


Re: Dependences for call-preserved regs on exposed pipeline target?

2012-11-26 Thread Greg McGary
On 11/25/12 23:33, Maxim Kuvyrkov wrote:
> You essentially need a fix-up pass just before the end of compilation 
> (machine-dependent reorg, if memory serves me right) to space instructions 
> consuming values from CPRs from the CALL_INSNS that set those CPRs.  I.e., 
> for the 99% of compilation you don't care about this restriction, it's only 
> the very last VLIW bundling and delay slot passes that need to know about it.
>
> You, probably, want to make the 2nd scheduler pass run as machine-dependent 
> reorg (as ia64 does) and enable an additional constraint (through scheduling 
> bypass) for the scheduler DFA to space CALL_INSNs from their consumers for at 
> least for 2 cycles.  One challenge here is that scheduler operates on basic 
> blocks, and it is difficult to track dependencies across basic block 
> boundaries.  To workaround basic-block scope of the scheduler you could emit 
> dummy instructions at the beginning of basic blocks that have predecessors 
> that end with CALL_INSNs.  These dummy instructions would set the appropriate 
> registers (probably just assign the register to itself), and you will have a 
> bypass (see define_bypass) between these dummy instructions and consumers to 
> guarantee the 2-cycle delay.

Thanks for the advice.  We're already on the same page--I have most of what you
recommend: I only schedule once from machine_dependent_reorg, after splitting
loads/stores, calls/branches into "init" and "fini" phases bound at fixed clock
offsets by record_delay_slot_pair().  I already have a fixup pass to handle
inter-EBB hazards.  (The selective scheduler would handle interblock
automatically, but I had trouble with it initially with split load/stores.  I 
want
to revisit that.)  Regarding CPRs, I strongly desire to avoid kludgy fixups for
schedules created with an incomplete dependence graph when the generic scheduler
can do the job perfectly with a complete dependence graph.

G


Re: Dependences for call-preserved regs on exposed pipeline target?

2012-11-26 Thread Greg McGary
On 11/26/12 12:46, Maxim Kuvyrkov wrote:

> I wonder if "kludgy fixups" refers to the dummy-instruction solution I 
> mentioned above.  The complete dependence graph is a myth.  You cannot have a 
> complete dependence graph for a function -- scheduler works on DAG regions 
> (and I doubt it will ever support anything more complex), so you would have 
> to do something to account for inter-region dependencies anyway.
>
> It is simpler to have a unified solution that would handle both inter- and 
> intra-region dependencies, rather than implementing two different approaches.

I retract any implication that your bypass proposal is a kludge.  I found using
bypasses to be very compact and effective.  Thanks for the extra nudge.

G


Trouble with powerpc64 mfpgpr patch

2007-07-12 Thread Greg McGary
I extracted the MFPGPR hunks from Peter Bergner's "[PATCH] Add POWER6 
machine description", posted on 2006-11-01 and dropped them into 
gcc-4.0.3, but the result fails with "error: insn does not satisfy its 
constraints":


.../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c: In function 
'__gcc_qadd':
.../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:127: error: insn 
does not satisfy its constraints:
(insn 193 64 152 4 
.../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:110 (set (reg:DF 10 10)

   (plus:DF (reg:DF 34 2 [orig:124 D.1365 ] [124])
   (reg:DF 32 0 [146]))) 177 {*adddf3_fpr} (nil)
   (expr_list:REG_DEAD (reg:DF 34 2 [orig:124 D.1365 ] [124])
   (expr_list:REG_DEAD (reg:DF 32 0 [146])
   (nil
.../src/gcc-4.0.3/gcc/config/rs6000/darwin-ldouble.c:127: internal 
compiler error: in copyprop_hardreg_forward_1, at regrename.c:1583

Please submit a full bug report,
with preprocessed source if appropriate.

The complaint is about operand[0], which is an integer register with 
DFmode.  FYI, I did my own work to put mftgpr/mffgpr into GCC-4.0.x last 
year, and ran into this same problem.  I solved it by changing the 
operand predicates everywhere I found an "f" constraint so that the 
predicate only allowed FP regs rather than the permissive 
gpc_reg_operand.  Alhough this worked, I didn't like it because the 
change was very invasive, so when I saw that Peter's patch didn't muck 
with the FP operand predicates, I wanted to use it instead.  Alas, I 
have the same problem with integer registers matching gpc_reg_operand, 
but not satisfying the "f" constraint.  What am I missing?  Is there 
something inhospitable about GCC-4.0 vs. the trunk for Peter's 
changes?   The patch I used is attached.


Thanks, Greg

2006-11-01  Pete Steinmetz  <[EMAIL PROTECTED]>
Peter Bergner  <[EMAIL PROTECTED]>

* config/rs6000/rs6000.md (define_attr "type"): Add mffgpr
and mftgpr attributes.
(floatsidf2,fix_truncdfsi2): use TARGET_MFPGPR.
(fix_truncdfsi2_mfpgpr): New.
(floatsidf_ppc64_mfpgpr): New.
(floatsidf_ppc64): Added !TARGET_MFPGPR condition.
(movdf_hardfloat64_mfpgpr,movdi_mfpgpr): New.
(movdf_hardfloat64): Added !TARGET_MFPGPR condition.
(movdi_internal64): Added !TARGET_MFPGPR and related conditions.
* config/rs6000/rs6000.h (TARGET_MFPGPR): New.
(SECONDARY_MEMORY_NEEDED): Use TARGET_MFPGPR.
(SECONDARY_MEMORY_NEEDED): Added mode!=DFmode and mode!=DImode
conditions.

Index: gcc-4.0.3/gcc/config/rs6000/rs6000.h
===
--- gcc-4.0.3.orig/gcc/config/rs6000/rs6000.h
+++ gcc-4.0.3/gcc/config/rs6000/rs6000.h
@@ -201,6 +201,9 @@ extern int target_flags;
 /* Use single field mfcr instruction.  */
 #define MASK_MFCRF 0x0008
 
+/* Use FP <-> GP register moves.  */
+#define MASK_MFPGPR0x0020
+
 /* The only remaining free bits are 0x0060.  linux64.h uses
0x0010, and sysv4.h uses 0x0080 -> 0x4000.
0x8000 is not available because target_flags is signed.  */
@@ -223,6 +226,7 @@ extern int target_flags;
 #define TARGET_SCHED_PROLOG(target_flags & MASK_SCHED_PROLOG)
 #define TARGET_ALTIVEC (target_flags & MASK_ALTIVEC)
 #define TARGET_AIX_STRUCT_RET  (target_flags & MASK_AIX_STRUCT_RET)
+#define TARGET_MFPGPR  (target_flags & MASK_MFPGPR)
 
 /* Define TARGET_MFCRF if the target assembler supports the optional
field operand for mfcr and the target processor supports the
@@ -234,7 +238,6 @@ extern int target_flags;
 #define TARGET_MFCRF 0
 #endif
 
-
 #define TARGET_32BIT   (! TARGET_64BIT)
 #define TARGET_HARD_FLOAT  (! TARGET_SOFT_FLOAT)
 #define TARGET_UPDATE  (! TARGET_NO_UPDATE)
@@ -365,6 +368,10 @@ extern int target_flags;
N_("Generate single field mfcr instruction")},  \
   {"no-mfcrf", - MASK_MFCRF,   \
N_("Do not generate single field mfcr instruction")},\
+  {"mfpgpr",   MASK_MFPGPR,\
+   N_("Generate moves between floating and general 
registers")},   \
+  {"no-mfpgpr",- MASK_MFPGPR,  
\
+   N_("Do not generate moves between floating and general 
registers")},\
   SUBTARGET_SWITCHES   \
   {"", TARGET_DEFAULT | MASK_SCHED_PROLOG, \
""}}
@@ -1413,12 +1420,18 @@ enum reg_class
   secondary_reload_class (CLASS, MODE, IN)
 
 /* If we are copying between FP or AltiVec registers and anything
-   else, we need a memory location.  */
-
-#define SECONDARY_MEMORY_NEEDED(CLASS1,CLASS2,MODE)\
- ((CLASS1) != (CLASS2) && ((CLASS1) == FLOAT_REGS  \
-  || (CLASS2) == FL

[RISC-V] vector segment load/store width as a riscv_tune_param

2025-03-24 Thread Greg McGary
I am revisiting an effort to make the number of lanes for vector segment
load/store a tunable parameter.

A year ago, Robin added minimal and not-yet-tunable
common_vector_cost::segment_permute_[2-8]

Some issues & questions:

* Since this pertains only to segment load/store, why is the word "permute"
  in the name?

* Nit: why are these defined as individual members rather than an array
  referenced as segment_permute[NF-2]?

* I implemented tuning as a simple threshold for max NF where segment
  load/store is profitable. Test cases for vector segment store pass, but
  tests for load fail. I found that common_cost_vector::segment_permute is
  properly honored in the store case, but not even inspected in the load
  case. I will need to spelunk the autovec cost model. Clues are welcome.

G


Re: [RISC-V] vector segment load/store width as a riscv_tune_param

2025-03-26 Thread Greg McGary
On Wed, Mar 26, 2025 at 1:44 AM Robin Dapp  wrote:

> > You won't see failures in the testsuite. The failures only show-up when I
> > attempt to impose huge costs on NF above threshold. A quick & dirty way
> to
> > expose the bug is apply the appended patch, then observe that you get
> output
> > from this only for mask_struct_store-*.c and not for mask_struct_load-*.c
>
> I suppose that's due to Richi's restructuring of the vector/SLP code.
> What
> might work is (untested):
>

It's a winner for my tests! Gracias.

G


Re: [RISC-V] vector segment load/store width as a riscv_tune_param

2025-03-25 Thread Greg McGary
On Tue, Mar 25, 2025 at 2:47 AM Robin Dapp  wrote:


> > A year ago, Robin added minimal and not-yet-tunable
> > common_vector_cost::segment_permute_[2-8]
>
> But it is tunable, just not a param? :)


I meant "param" generically, not necessarily a command-line --param=thingy,
though point taken! :)


> We have our own cost structure in our
> downstream repo, adjusted to our uarch.  I suggest you do the same or
> upstream
> a separate cost structure.  I don't think anybody would object to having
> several of those, one for each uarch (as long as they are sufficiently
> distinct).
>

Yes, this is what I meant by not-yet-tunable, there is currently no datapath
between -mcpu/-mtune and common_vector_cost::segment_permute_*. All CPUs get
the same hard-coded value of 1 for all segment_permute_* costs.


> BTW, just tangentially related and I don't know how sensitive your uarch
> is to
> scheduling, but with the x264 SAD and other sched issues we have seen you
> might
> consider disabling sched1 as well for your uarch?  I know that for our
> uarch we
> want to keep it on but we surely could have another generic-like mtune
> option
> that disables it (maybe even generic-ooo and change the current
> generic-ooo to
> generic-in-order?).  I would expect this to get more common in the future
> anyway.


Thanks for the tip. We will look into it.


> > Some issues & questions:
> >
> > * Since this pertains only to segment load/store, why is the word
> "permute"
> >   in the name?
>
> The vectorizer already performs costing for the segment loads/stores (IIRC
> as
> simple loads, though).  At some point the idea was to explicitly model the
> "segment permute/transpose" as a separate operation i.e.
>

This is a different concept, so I ought to introduce a new cost param which
is
the threshold value of NF for fast vs. slow.

> * I implemented tuning as a simple threshold for max NF where segment
> >   load/store is profitable. Test cases for vector segment store pass, but
> >   tests for load fail. I found that common_cost_vector::segment_permute
> is
> >   properly honored in the store case, but not even inspected in the load
> >   case. I will need to spelunk the autovec cost model. Clues are welcome.
>
> Could you give an example for that?  Might just be a bug.
> Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I
> see
> that the cost is adjusted for loads, though.


You won't see failures in the testsuite. The failures only show-up when I
attempt to impose huge costs on NF above threshold. A quick & dirty way to
expose the bug is apply the appended patch, then observe that you get output
from this only for mask_struct_store-*.c and not for mask_struct_load-*.c

G

--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1140,6 +1140,7 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt
kind, loop_vec_info loop,
  int group_size = segment_loadstore_group_size (kind,
stmt_info);
  if (group_size > 1)
{
+   fprintf (stderr, "segment_loadstore_group_size = %d\n",
group_size);
  switch (group_size)
{
case 2: