Hi, Richard!

This is Alexey from Samsung Research, and I am interested in FSRA patch, as it 
addresses one of the causes of the #108016 issue.
You might notice it as we've already made some preliminary discussion before.

> void baz (_Complex float *);
> _Complex float bar (_Complex float x)
> {
>    _Complex float k = x;
>    baz (&k);
>    return k;
> }
> 
> where on foo I see expected behavior while for bar we reject to work
> on 'x' because it's not aggregate - that should possibly be improved.

I've tried to cobble something together to add the work with built-in 
complexes. For input ARG_PARTS it works well, while for output value it does 
not always work. This is related to cases where the SSA generates additional 
form for returning value and returns it (e.g. return _5;). SSA_NAME might be 
forced to treated as candidate for FSRA as function return value. But it won't 
be handled by SRA logic because of it is not VAR_P decl.

Additionally, enabling the handling of complex variables or its parts will be 
tightly coupled with whole SRA, as it has a lot of common places in the code 
related to REAL/IMAG logic. This makes me think that adding complex-es to whole 
SRA pass variations is a really good idea, which deserves to be formed in the 
separate feature patch (maybe someone already raised a discussion on it before).

> x86 passes s in a single 64bit integer register and FSRA produces
> two .ARG_PARTS (so it also works when there's not a 1:1 correspondence
> between registers and uses in size). 

For the testcase, the tree-representation before FSRA:

struct S foo (struct S s) {
  struct S D.2965;
  int _1;
  float _2;
  float _3;
  int _4;

  <bb 2> [local count: 1073741824]:
  _1 = s.j;
  _2 = (float) _1;
  _3 = s.x;
  _4 = (int) _3;
  D.2965.x = _2;
  D.2965.j = _4;
  return D.2965;
}

So that different accesses for D.2965.x and D.2965.j, that are being handled 
separately by analyze_access_subtree() and then by generate_subtree_copies(). 
To handle it, SRA could combine/fuse two ARG_PART() calls into one, which we've 
discussed with Jinfu Guo on the previous week. He believes that this is pretty 
good idea but it was not implemented in this patch because:
1. It need more large code re-work, as ARG_PARTs on the same destination 
register may not belong to the same basic-block.
2. To combine the 2+ ARG_PARTs, we may still need one/more pseudo to keep the 
temp value. So the instruction seq might be suboptimal.

I am also not sure, whether it is proper on SRA-level to combine/fuse two 
different scalar accesses to put later them into one .ARG_PART call, as we have 
no idea on tree-level about final H/W registers layout.
So, this could be made in some later improvement patches if possible. If you 
have any clue how it could be in this implementation, may I ask you to share 
some thoughts on it?

> I don't think x86 would ever
> split a structure member between two registers but I don't see the
> FSRA pass taking into account how the incoming arguments split up
> an aggregate?  I am not sure the correct thing is to add those
> as accesses to SRA. 

Yep, I did't observe any case where the GCC or LLVM could handle such 
situations. For example, in the following test-case s.j will be forced to be 
splitted-up bewtween two registers:

struct S {int i; long j; int k; };

long foo (struct S s) {
  return s.j + s.i * s.k;
}

In GCC, VREGS and DSE passes it makes the usage of memory addressing instead. 
If place "long j" to the first place in struct S, VREGS will optimize it on the 
frame, while DSE will replace it with registers. Currently, FSRA also does not 
care about cases when structure member is split between two registers as well, 
which I believe would be logical. And in case when "long j" at the first place, 
FSRA will also handle it as a valid case.

> to get two 64bit reg passing shows we then construct a TImode
> reg as before the patch but we'll still have the .ARG_PARTS?
> I wonder if the code would simplify if it would not need to
> fallback to old expand by making sure we can expand .ARG_PARTS
> appropriately in an efficient way?  Basically try to make SRA
> work on the parts as determined by the calling convention
> rather than the full aggregate as currently RTL expansion
> would create.

In other words, do you mean the maximim size in bytes that we can place all 
ARG_PARTS() / SET_RET_PARTS() on to fit H/W calling conventions?
 
> I wonder about the precompute_register_parameters change,
> can you comment on it?

What I've found during internal testing that after FSRA, function input 
argument registers could be in parallel:BLK mode, like:
(parallel:BLK [
        (expr_list:REG_DEP_TRUE (reg:DI 139)
            (const_int 0 [0]))
        (expr_list:REG_DEP_TRUE (reg:DI 140)
            (const_int 8 [0x8]))
    ])

Which will cause further sanity check assertion in simplify_subreg() during the 
load_register_parameters() of call expansion, where the expanded registers to 
be in such mode. This change is intended fix this situation. The 
expand_call_stmt() routine from cfgexpand.cc works with trees, while 
expand_call() from calls.cc actually operates RTX. I think that's why the 
change was added in precompute_register_parameters(). I would also like to add 
extra (GET_MODE (args[i].value) == BLKmode) check here to indicate the 
intention.
 
> I think the copy_blkmode_to_reg part shows one issue
> with not aligning .ARG_PARTS with incoming reg boundaries?

If I understood correctly, this is done to cover similar ICEs in the code 
generated by BLK register modes from .ARG_PARTS. Alignment may be necessary at 
this level: FSRA don't know know anything about H/W register layout while 
generating ARG_PART, while copy_blkmode_to_reg() just emits group store here 
for BLK-modes. Then later in the expand, we are calling assign_to/from_regs() 
to map/align them on registers.

> I think you are reverse-engineering this in query_position_in_series?
> I'd appreciate somebody more familiar with RTL and RTL expansion
> in particular to review the expansion bits.

In a way yes, as I see from the patch. All register boundary alignments are 
made here. If the optimization is possible, query_position_in_series() reports 
it and assign_to/from_regs() does it. This routine seem to handle more widely 
all cases, not only the logic later covered by copy_blkmode_to_reg() part.

> I do like the overall change, as said - I hope it could be
> simplified by avoiding handling of .ARG_PARTs or .SET_RET_PARTS
> that would touch multiple registers but I'm also not sure
> whether that's needed, say on powerpc.
> 
> Maybe Jeff can run the patch on his tester to cover more
> targets.

It would be great!
Once we will agree on the main items, I also has an ability to do a regression 
and csmith ~30K -cases testing for x86_64, AArch64, RISC-V 64 and MIPS targets; 
and report it there.

So, if/how do you see this patch should be refined?
I am relatively green in GCC internals, so please let me know if I'm a bit 
mixed up.

Best regards,
Merzlyakov Alexey


Reply via email to