On 27/04/12 21:24, David Sehr wrote: > Hello All, > > We are using gcc trunk as of 4/27/12, and are attempting to add > support to the ARM gcc compiler for Native Client. > We are trying to get gcc -march=armv7-a to use movw/movt consistently > instead of minipools. The motivation is for > a new target variant where armv7-a is the minimum supported and > non-code in .text is never allowed (per Native Client rules). > But the current behavior looks like a generically poor optimization > for -march=armv7-a. (Surely memory loads are slower > than movw/movt, and no space is saved in many cases.) For further > details, this seems to only happen with -O2 or higher. > -O1 generates movw/movt, seemingly because cprop is folding away a > LO_SUM/HIGH pair. Another data point to note > is that "Ubuntu/Linaro 4.5.2-8ubuntu3" does produce movw/movt for this > test case, but we haven't tried stock 4.5. >
It very much depends on your processor. MOVW/MOVT are inherently serial and are unlikely to issue in parallel on a multi-issue processor. LDR is likely to hit cache and thus with appropriate scheduling only take up one cycle. Furthermore, there is a chance of the constant being shared which can never happen with MOVW/MOVT. You also need to consider how you'd handle literals larger than 32-bits. It starts to get quite expensive to synthesize 64-bit and larger values, especially if you have an FPU or NEON and want to set the value in a VFP register; but pushing the values out to non-literal pool memory is quite expensive. None of the above means that avoiding literal pools is impossible, just that it's not as simple as just using MOVW/MOVT. > I have enabled TARGET_USE_MOVT, which should force a large fraction of > constant materialization to use movw/movt > rather than pc-relative loads. However, I am still seeing pc-relative > loads for the following example case and am looking > for help from the experts here. > > int a[1000], b[1000], c[1000]; > > void foo(int n) { > int i; > for (i = 0; i < n; ++i) { > a[i] = b[i] + c[i]; > } > } > > When I compile this I get: > > foo: > ... > ldr r3, .L7 > ldr r1, .L7+4 > ldr r2, .L7+8 > ... > .L7: > .word b > .word c > .word a > .size foo, .-foo > .comm c,4000,4 > .comm b,4000,4 > .comm a,4000,4 > >>From some investigation, it seems I need to add a define_split to > convert SYMBOL_REFs to LO_SUM/HIGH pairs. > There is already a function called arm_split_constant that seems to do > this, but no rule seems to be firing to cause > it to get invoked. Before I dive into writing the define_split, am I > missing something obvious? > setting prefer_constant_pool to 0 in a particular processor's tuning configuration should change the balance -- see the code in arm.[ch]. This is a source-code change -- it's not selectable from the command line. R.