On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka <siarhei.siamas...@gmail.com> wrote: > Hello, > > Currently gcc (at least version 4.5.0) does a very poor job generating single > precision floating point code for ARM Cortex-A8. > > The source of this problem is the use of VFP instructions which are run on a > slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on RunFast mode > (flush denormals to zero, disable exceptions) just provides a relatively minor > performance gain. > > The right solution seems to be the use of NEON instructions for doing most of > the single precision calculations. > > I wonder if it would be difficult to introduce the following changes to the > gcc generated code when optimizing for cortex-a8: > 1. Allocate single precision variables only to evenly or oddly numbered > s-registers. > 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do > 'vadd.f32 d0, d0, d1' instead. > > The number of single precision floating point registers gets effectively > halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky > (packing/unpacking of register pairs may be needed to ensure proper parameters > passing to functions). Also there may be other problems, like dealing with > strict IEEE-754 compliance (maybe a special variable attribute for relaxing > compliance requirements could be useful). But this looks like the only > solution to fix poor performance on ARM Cortex-A8 processor. > > Actually clang 2.7 seems to be working exactly this way. And it is > outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single precision > floating point tests that I tried on ARM Cortex-A8.
On i?86 we have -mfpmath={sse,x87}, I suppose you could add -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard and requiring neon support). Richard. > -- > Best regards, > Siarhei Siamashka >