James Greenhalgh <james.greenha...@arm.com> writes: > Hi, > > For this code: > > struct y { > float x[4]; > }; > > float > bar3 (struct y x) > { > return x.x[3]; > } > > GCC generates: > > bar3: > fmov x1, d2 > mov x0, 0 > bfi x0, x1, 0, 32 > fmov x1, d3 > bfi x0, x1, 32, 32 > sbfx x0, x0, 32, 32 > fmov s0, w0 > ret > > If you can wrap your head around that, you'll spot that it could be > simplified to: > > bar3: > fmov s0, s3 > ret
I get the second version with current trunk, at -O and above. Does this only happen with some other modifications, or for certain subtargets? Or maybe I'm testing the wrong thing. > Looking at it, I think the issue is the mode that we assign to the > PARALLEL we build for an HFA in registers. When we get in to > aarch64_layout_arg with a composite, MODE is set to the smallest integer > mode that would contain the size of the composite type. That is to say, in > the example above, MODE will be TImode. > > Looking at the expansion path through assign_parms, we're going to go: > > assign_parms > assign_parm_setup_reg > assign_parm_remove_parallels > emit_group_store > > assign_parm_remove_parallels is going to try to create a REG in MODE, > then construct that REG using the values in the HFA PARALLEL we created. So, > for the example above, we're going to try to create a packed TImode value > built up from each of the four "S" registers we've assigned for the > arguments. Using one of the struct elements is then a 32-bit extract from > the TImode value (then a move back to FP/SIMD registers). This explains > the code-gen in the example. Note that an extract from the TImode value > makes the whole TImode value live, so we can't optimize away the > construction in registers. > > If instead we make the PARALLEL that we create in aarch64_layout_arg BLKmode > then our expansion path is through: > > assign_parms > assign_parm_setup_block > > Which handles creating a stack slot of the right size for our HFA, and > copying it to there. We could then trust the usual optimisers to deal with > the object construction and eliminate it where possible. > > However, we can't just return a BLKmode Parallel, as the mid-end was explictly > asking us to return in MODE, and will eventually ICE given the inconsistency. > One other way we can force these structures to be given BLKmode is through > TARGET_MEMBER_TYPE_FORCES_BLK. Which is what we do in this patch. > > We're going to tell the mid-end that any structure of more than one element > which contains either floating-point or vector data should be set out in > BLKmode rather than a large-enough integer mode. In doing so, we implicitly > fix the issue with HFA layout above. But at what cost! The patch only seems to handle scalar floats, not vectors. Wasn't replying to this to nitpick though :-). I just wondered whether there would be any benefit to having an array_mode hook that returns V4SF for float[4]. (We needed that hook to handle load/store lanes for SVE, so see git branch linaro-dev/sve if you want to try it.) Maybe alignment would be a problem though? Thanks, Richard