On Mon, 2 May 2016, Eric Botcazou wrote: > > The following experiment resulted from looking at making > > array_ref_low_bound and array_ref_element_size non-mutating. Again > > I wondered why we do this strange scaling by offset/element alignment. > > The idea is to expose the alignment factor to the RTL expander: > > tree tem > = get_inner_reference (exp, &bitsize, &bitpos, &offset, &mode1, > &unsignedp, &reversep, &volatilep, true); > > [...] > > rtx offset_rtx = expand_expr (offset, NULL_RTX, VOIDmode, > EXPAND_SUM); > > [...] > > op0 = offset_address (op0, offset_rtx, > highest_pow2_factor (offset)); > > With the scaling, offset is something like _69 * 4 so highest_pow2_factor can > see the factor and passes it down to offset_address: > > (gdb) p debug_rtx(op0) > (mem/c:SI (plus:SI (reg/f:SI 193) > (reg:SI 194)) [3 *s.16_63 S4 A32]) > > With your patch in the same situation: > > (gdb) p debug_rtx(op0) > (mem/c:SI (plus:SI (reg/f:SI 139) > (reg:SI 116 [ _33 ])) [3 *s.16_63 S4 A8]) > > On strict-alignment targets, this makes a big difference, e.g. SPARC: > > ld [%i4+%i5], %i0 > > vs > > ldub [%i5+%i4], %g1 > sll %g1, 24, %g1 > add %i5, %i4, %i5 > ldub [%i5+1], %i0 > sll %i0, 16, %i0 > or %i0, %g1, %i0 > ldub [%i5+2], %g1 > sll %g1, 8, %g1 > or %g1, %i0, %g1 > ldub [%i5+3], %i0 > or %i0, %g1, %i0 > > > Now this is mitigated by a couple of things: > > 1. the above pessimization only happens on the RHS; on the LHS, the > expander > calls highest_pow2_factor_for_target instead of highest_pow2_factor and the > former takes into account the type's alignment thanks to the MAX: > > /* Similar, except that the alignment requirements of TARGET are > taken into account. Assume it is at least as aligned as its > type, unless it is a COMPONENT_REF in which case the layout of > the structure gives the alignment. */ > > static unsigned HOST_WIDE_INT > highest_pow2_factor_for_target (const_tree target, const_tree exp) > { > unsigned HOST_WIDE_INT talign = target_align (target) / BITS_PER_UNIT; > unsigned HOST_WIDE_INT factor = highest_pow2_factor (exp); > > return MAX (factor, talign); > } > > 2. highest_pow2_factor can be rescued by the set_nonzero_bits machinery of > the SSA CCP pass because it calls tree_ctz. The above example was compiled > with -O -fno-tree-ccp on SPARC; at -O, the code isn't pessimized. > > > So - the following patch gets rid of that scaling. For a "simple" > > C testcase > > > > void bar (void *); > > void foo (int n) > > { > > struct S { struct R { int b[n]; } a[2]; int k; } s; > > s.k = 1; > > s.a[1].b[7] = 3; > > bar (&s); > > } > > This only exposes the LHS case, here's a more complete testcase: > > void bar (void *); > > int foo (int n) > { > struct S { struct R { char b[n]; } a[2]; int k; } s; > s.k = 1; > s.a[1].b[7] = 3; > bar (&s); > return s.k; > }
Ok. Note that on x86_64 at least SLSR wrecks the component-ref case: @@ -30,10 +34,14 @@ _9 = _8 + 4; s.1_11 = __builtin_alloca_with_align (_9, 32); _12 = _7 >> 2; - s.1_11->k{off: _12 * 4} = 1; + _18 = _12 * 4; + _19 = s.1_11 + _18; + MEM[(struct S *)_19] = 1; s.1_11->a[1]{lb: 0 sz: _5}.b[7] = 3; bar (s.1_11); - _16 = s.1_11->k{off: _12 * 4}; + _20 = _12 * 4; + _21 = s.1_11 + _20; + _16 = MEM[(struct S *)_21]; __builtin_stack_restore (saved_stack.3_3); return _16; It seems to me that the issue in the end is that where we compute alignment from is the pieces gathered by get_inner_reference instead of computing it alongside of that information in get_inner_reference, taking advantage of DECL_OFFSET_ALIGN and the array element type alignment there. This would be another opportunity to merge get_inner_reference and get_object_alignment_2 (or have a common worker). That we expose the exact division in the IL has unfortunate side-effects like above. Do I understand you correctly that without using -fno-tree-ccp there are currently no regressions? What about -O0 then? The code generated by -O0 on x86_64 currently is quite horrible of course, so maybe we don't care too much... I think -Og disables CCPs bit-tracking though. I see that the constant offset parts are only sometimes folded into op0 before calling offset_address with the variable offset parts. Now with get_object_alignment_2 computing alignment and misalignment one could split out the misaligning offset from bitpos in some way to have op0 always be aligned. Sth like Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 235706) +++ gcc/expr.c (working copy) @@ -10334,6 +10334,9 @@ expand_expr_real_1 (tree exp, rtx target offset_rtx = convert_to_mode (address_mode, offset_rtx, 0); } + get_object_alignment_2 (exp, &align, &misalign, false); + bitpos -= misalign; + /* See the comment in expand_assignment for the rationale. */ if (mode1 != VOIDmode && bitpos != 0 @@ -10348,6 +10351,7 @@ expand_expr_real_1 (tree exp, rtx target op0 = offset_address (op0, offset_rtx, highest_pow2_factor (offset)); + set_mem_align (op0, align); } /* If OFFSET is making OP0 more aligned than BIGGEST_ALIGNMENT, which of course still would need adjustments to get_object_alignment_2 and get_innner_reference to recover alignment info by looking at DECL_OFFSET_ALIGN and friends. For the moment (as my original motivation was to get rid of the requirement to call component_ref_field_offset and friends from GIMPLE parts of the compiler to make them non-tree-mutating but that has shown to be impossible or rather difficult) I'm leaving things as-is. Thanks, Richard.