On 27/04/12 21:24, David Sehr wrote:
> Hello All,
> 
> We are using gcc trunk as of 4/27/12, and are attempting to add
> support to the ARM gcc compiler for Native Client.
> We are trying to get gcc -march=armv7-a to use movw/movt consistently
> instead of minipools. The motivation is for
> a new target variant where armv7-a is the minimum supported and
> non-code in .text is never allowed (per Native Client rules).
> But the current behavior looks like a generically poor optimization
> for -march=armv7-a.  (Surely memory loads are slower
> than movw/movt, and no space is saved in many cases.)  For further
> details, this seems to only happen with -O2 or higher.
> -O1 generates movw/movt, seemingly because cprop is folding away a
> LO_SUM/HIGH pair.  Another data point to note
> is that "Ubuntu/Linaro 4.5.2-8ubuntu3" does produce movw/movt for this
> test case, but we haven't tried stock 4.5.
> 

It very much depends on your processor.

MOVW/MOVT are inherently serial and are unlikely to issue in parallel on
a multi-issue processor.
LDR is likely to hit cache and thus with appropriate scheduling only
take up one cycle.  Furthermore, there is a chance of the constant being
shared which can never happen with MOVW/MOVT.

You also need to consider how you'd handle literals larger than 32-bits.
 It starts to get quite expensive to synthesize 64-bit and larger
values, especially if you have an FPU or NEON and want to set the value
in a VFP register; but pushing the values out to non-literal pool memory
is quite expensive.

None of the above means that avoiding literal pools is impossible, just
that it's not as simple as just using MOVW/MOVT.

> I have enabled TARGET_USE_MOVT, which should force a large fraction of
> constant materialization to use movw/movt
> rather than pc-relative loads.  However, I am still seeing pc-relative
> loads for the following example case and am looking
> for help from the experts here.
> 
> int a[1000], b[1000], c[1000];
> 
> void foo(int n) {
>   int i;
>   for (i = 0; i < n; ++i) {
>     a[i] = b[i] + c[i];
>   }
> }
> 
> When I compile this I get:
> 
> foo:
>         ...
> ldr r3, .L7
> ldr r1, .L7+4
> ldr r2, .L7+8
>         ...
> .L7:
> .word  b
> .word  c
> .word  a
> .size foo, .-foo
> .comm c,4000,4
> .comm b,4000,4
> .comm a,4000,4
> 
>>From some investigation, it seems I need to add a define_split to
> convert SYMBOL_REFs to LO_SUM/HIGH pairs.
> There is already a function called arm_split_constant that seems to do
> this, but no rule seems to be firing to cause
> it to get invoked.  Before I dive into writing the define_split, am I
> missing something obvious?
> 

setting prefer_constant_pool to 0 in a particular processor's tuning
configuration should change the balance -- see the code in arm.[ch].
This is a source-code change -- it's not selectable from the command line.

R.

Reply via email to