Hello,

Currently gcc (at least version 4.5.0) does a very poor job generating single 
precision floating point code for ARM Cortex-A8.

The source of this problem is the use of VFP instructions which are run on a 
slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode 
(flush denormals to zero, disable exceptions) just provides a relatively minor 
performance gain.

The right solution seems to be the use of NEON instructions for doing most of
the single precision calculations.

I wonder if it would be difficult to introduce the following changes to the 
gcc generated code when optimizing for cortex-a8:
1. Allocate single precision variables only to evenly or oddly numbered
s-registers.
2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
'vadd.f32 d0, d0, d1' instead.

The number of single precision floating point registers gets effectively 
halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
(packing/unpacking of register pairs may be needed to ensure proper parameters 
passing to functions). Also there may be other problems, like dealing with 
strict IEEE-754 compliance (maybe a special variable attribute for relaxing 
compliance requirements could be useful). But this looks like the only 
solution to fix poor performance on ARM Cortex-A8 processor.

Actually clang 2.7 seems to be working exactly this way. And it is
outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision
floating point tests that I tried on ARM Cortex-A8.

-- 
Best regards,
Siarhei Siamashka

Reply via email to