Richard Sandiford wrote:

> As I mentioned in:
>
>     http://gcc.gnu.org/ml/gcc/2008-08/msg00476.html
>
> I'd been working on a MIPS IRA port, but got side-tracked by a wrong-code
> regression.
>
> The regression was caused by incorrect EH liveness information. I tried
> to "fix" it by replacing the current note_stores-based forward scan with
> a DF-based backwards scan, which gave us the corrected information for free.
> However, even after Vlad fixed a pessimisation wrt calls and
> DF_REF_MAY_CLOBBER, he saw a significant decrease in SPECFP performance.
> It seemed to me (and I think to Vlad) that the differences were caused
> by the "inverted" ALLOCNO_NUM ordering.  Before the patch, allocno
> indexes typically increased as you went from the beginning of a region
> to the end, whereas it was the other way around after the patch.
>

I checked SPEC2000 again with your patches and there is no performance
regression anymore.  Although I checked this with two patches which I
submitted but did not get an approval yet.

> In other words, a lot of the difference seemed to be down to the
> last line of bucket_allocno_compare_func:
>
> static int
> bucket_allocno_compare_func (const void *v1p, const void *v2p)
> {
>   ira_allocno_t a1 = *(const ira_allocno_t *) v1p;
>   ira_allocno_t a2 = *(const ira_allocno_t *) v2p;
>   int diff, a1_freq, a2_freq, a1_num, a2_num;
>
> if ((diff = (int) ALLOCNO_COVER_CLASS (a2) - ALLOCNO_COVER_CLASS (a1)) != 0)
>     return diff;
>   get_coalesced_allocnos_attributes (a1, &a1_freq, &a1_num);
>   get_coalesced_allocnos_attributes (a2, &a2_freq, &a2_num);
>   if ((diff = a2_num - a1_num) != 0)
>     return diff;
>   else if ((diff = a1_freq - a2_freq) != 0)
>     return diff;
>   return ALLOCNO_NUM (a2) - ALLOCNO_NUM (a1);
> }
>
> This line is really just a tie-breaker, but changing:
>
>   return ALLOCNO_NUM (a2) - ALLOCNO_NUM (a1);
>
> to:
>
>   return ALLOCNO_NUM (a1) - ALLOCNO_NUM (a2);
>
> has a significant effect on the results.  The question then
> (and now) is: is that Just One Of Those Things?  I realise we're
> dealing with an NP problem, and that we can't expect perfect results.
> So is the influence of the allocno index ordering acceptable?
>

You are right it is a NP problem and we use heuristics which are not
perfect but we could improve them.

> I wasn't sure either way, and found a simple case where, with the
> reversed order, the copy costs were actually causing us to pick
> the one register that would induce a copy reload.  Unfortunately,
> that particular testcase no longer exhibits the problem, so I redid
> the experiment on trunk and picked an arbitrary example of where
> the reversed order was causing a difference.  (In this case,
> the reversed order is better, because it avoids a copy.)
>
> It turned out not to be quite the same problem that I'd seen before,
> but the underlying questions about the copy costs are the same.
> It also shows a potential problem with operand ties involving
> hard registers.
>
> I've attached a patch that replaces ALLOCNO_NUM comparisons with
> a new macro, ALLOCNO_COMPARE.  Apply the patch as-is to swap the order.
> Replace:
>
>     #define ALLOCNO_COMPARE(A1, A2) (ALLOCNO_NUM (A2) - ALLOCNO_NUM (A1))
>
> with:
>
>     #define ALLOCNO_COMPARE(A1, A2) (ALLOCNO_NUM (A1) - ALLOCNO_NUM (A2))
>
> to keep the current order.
>
> I've also attached the DF patch, although you don't need it for what follows.
>
> The testcase I picked was gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c,
> compiled at -Os on x86_64-linux-gnu.  The rtl for the function is:
>

I've reproduced the problem with additional -fno-tree-reassoc.  By the
way I found that the old register allocator has the same problem.


> (insn:HI 6 3 7 2 ../../../../new/gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c:7 (set (reg:DF 61) > (mem/u/c/i:DF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [2 S8 A64])) 103 {*movdf_integer_rex64} (expr_list:REG_EQUAL (const_double:DF 5.0e+0 [0x0.ap+3])
>         (nil)))
>
> (insn:HI 7 6 8 2 ../../../../new/gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c:7 (set (reg/v:DF 58 [ tmp2 ])
>         (plus:DF (reg/v:DF 60 [ a ])
>             (reg:DF 61))) 734 {*fop_df_comm_sse} (nil))
>
> (insn:HI 8 7 9 2 ../../../../new/gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c:7 (set (reg:DF 63)
>         (minus:DF (reg/v:DF 58 [ tmp2 ])
> (reg/v:DF 60 [ a ]))) 745 {*fop_df_1_sse} (expr_list:REG_DEAD (reg/v:DF 58 [ tmp2 ])
>         (nil)))
>
> (insn:HI 9 8 11 2 ../../../../new/gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c:7 (set (reg:DF 64)
>         (plus:DF (reg/v:DF 60 [ a ])
> (reg:DF 63))) 734 {*fop_df_comm_sse} (expr_list:REG_DEAD (reg:DF 63)
>         (expr_list:REG_DEAD (reg/v:DF 60 [ a ])
>             (nil))))
>
> (insn:HI 16 11 22 2 ../../../../new/gcc/testsuite/gcc.dg/tree-ssa/reassoc-13.c:10 (set (reg/i:DF 21 xmm0)
>         (minus:DF (reg:DF 64)
> (reg:DF 61))) 745 {*fop_df_1_sse} (expr_list:REG_DEAD (reg:DF 64)
>         (expr_list:REG_DEAD (reg:DF 61)
>             (nil))))
>
> So let's use "Rx" to refer to pseudo register X, "Ax" to refer to
> its allocno, and "Hx" to refer to its allocated hard register.
> Allocnos are popped and allocated in the following orders:
>
>     Original  Reversed
>     --------  --------
>     A60       A60
>     A61       A61
>     A58       A64
>     A63       A63
>     A64       A58
>
> Both orders agree on the following assignments:
>
>     H58 = xmm2
>     H60 = xmm0
>     H61 = xmm1
>     H63 = xmm2
>
> The problem is with the allocation of A64.  R64 is tied twice:
>
>     - to R60 or R63 in insn 9
>     - to xmm0 in insn 16
>
> Both orderings set H60 to xmm0 before allocating A64, so setting
> H64 to xmm0 would satisfy both ties without reloads.  The post-patch
> compiler does this, but the pre-patch one doesn't.
>
> The original order allocates A64 last of all, and since xmm0 is still
> free at that point, the original order ought to have the best chance
> of picking the "right" answer.  But the costs are weighted in favour
> of xmm2 instead:
>
>     Costs when allocating A64 in original order:
>        xmm0 : -187
>        xmm1 : 3000
>        xmm2 : -750
>        xmm3 : 3000
>        ...  : 3000
>
> So the original order picks xmm2 and generates a reload.
>
> The main problem in this particular case is that insn 16 doesn't seem to
> affect the relative cost of assigning xmm0 to A64.  The A64 costs are
> only influenced by insn 9.
>
> Even so, I was surprised that xmm2 was so heavily preferred over xmm0.
> The copies for this function are:
>
>     C1: insn 8: 58 -> 63, freq 1000
>     C2: insn 9: 60 -> 64, freq 1000
>     C3: insn 9: 63 -> 64, freq 1000
>
> (C3 exists -- with the same frequency as C2 -- because PLUS is commutative.)
> And the effect of these copies on the costs are as follows:
>
>     - After assigning xmm0 to A60:
>          C2: cost of xmm0 for A64 -= 3000
>          -->C3: cost of xmm0 for A63 -= 750
>             -->C3: cost of xmm0 for A64 -= 187 [*]
>             -->C1: cost of xmm0 for A58 -= 187
>                -->C1: cost of xmm0 for A63 -= 46 [*]
>
>     - After assigning xmm2 to A58:
>          C1: cost of xmm2 for A63 -= 3000
>          -->C3: cost of xmm2 for A64 -= 750
>             -->C3: cost of xmm2 for A63 -= 187 [*]
>
>     - After assigning xmm2 to A63:
>          C3: cost of xmm2 for A64 -= 3000
>
> How standard are these copy-cost heuristics?  Are they ours, or are they
> commonplace?
>

No, they are not commonplace.  As matter of fact, I don't know other
RA which uses dynamicaly changed costs for each hard-registers.  So
this is a new thing.

> The adjustments marked [*] seem to be double-counting.  E.g. in the
> first case, C2 first causes us to adjust the xmm0 cost of A64 by -3000.
> We then recursively process all other copies involving A64 and adjust
> the costs for the "other side" of those copies by a smaller amount.
> So in this case, C3 causes us to adjust A63's xmm0 cost by -750.
> But we then adjust the copies involving A63, and end up handling
> C2 twice.
>
> This might not matter much if it's done consistently, but I wasn't
> sure whether it was deliberate.  (A comment might be nice if so.)
>

Thanks for the proposal.  I'll add some comment.

> More fundamentally, C3 ends up having more influence than C2,
> even though the copies ostensibly have the same weight.  In other words,
> we have the following copy graph:
>
>     A58 <-> A63 <-> A64 <-> A60
>
> And the allocation of A63 has more influence over A64 than A60 does,
> because of the transitive influence of A58.  But shouldn't this
> transitive influence only kick in when A64 is allocated _between_
> A58 and A63?  Why should the allocation of A58 matter to A64 if
> A63 has already been allocated?
>

I'll think about it and try to improve the heuristics. Indirect
influence of allocno assignment to allocno not connected directly to
given allocno was added to improve the code generation.

> This is just a small example.  Couldn't we have a graph like:
>
>     ... <--> A4 <-> A3 <-> A2 <-> A1
>
> where, if things are allocated left-to-right, and a long chain of
> registers including A4 is able to use the same hard register,
> the combined effect of that chain actually has more influenece
> over A2 than the choice of either A3 or A1?  I.e. couldn't we
> have cases where the costs make A2 prefer H4 over H3, in cases
> where H4 and H3 are different?
>

With my point of view, the major problem in this case is in that tying
of xmm0 with r64 in insn #16 is ignored.  By the way such insns are not
typical for x86 (I did not find any in SPEC2000). First of all cost
calculation in regclass (used by the old register allocator) and in
ira-costs.c (used by IRA) does not take such insns into account.  They
take into account only move insn involving hard-register of small
class and pseudo-register.  Ideally the cost for class SSE_FIRST_REG
should be minimal and allocno for r64 should have SSE_FIRST_REG as the
best reg class and SSE_REGS as an alternative register class.  It is
not done.

But the patch

http://gcc.gnu.org/ml/gcc-patches/2008-08/msg02279.html

should solve the problem.  I missed only one thing for this test.
Instead of `only_regs_p' in ira-conflicts.c::process_regs_for_copy,
there should be `only_regs_p && insn !=NULL_RTX' to distinguish
move insn which was already taken into account in ira-costs.c.
With the patch and the small change, r64 gets xmm0 and the problem
disappears.  IRA will generate a better code than the
old RA.

> It took me quite a while to go through the decisions being made in this
> case, and it was only towards the end that I realised it wasn't quite
> the same situation I'd seen before.  (The situation I'd seen before
> didn't involve a hard register tie like the one in insn 16.)  I can try
> trawling through the other differences to find another example of the
> original problem, but it was all down to the effects of transitive
> copy costs, so I'd be interested to hear comments above whether the
> above is reasonable behaviour or not.
>
> If using DF seems like the Right Thing, we could simply apply both
> patches, which would give a similar same allocno order to the one
> we have now.  But it seemed better to look a bit deeper first...
>

Richard, please apply the both patches.  As I wrote above there is no
SPECFP regression anymore with the patches.  They also solves some
testsuite regressions concerning EH.

Thanks for the patch, it is always a pleasure to read a deep problem
analysis.

Vlad

Reply via email to