On Thu, Jul 1, 2010 at 5:30 PM, Richard Earnshaw <rearn...@arm.com> wrote: > > On Wed, 2010-06-16 at 08:09 -0700, Andrew Pinski wrote: >> >> Sent from my iPhone >> >> On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guent...@gmail.com >> > wrote: >> >> > On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka >> > <siarhei.siamas...@gmail.com> wrote: >> >> Hello, >> >> >> >> Currently gcc (at least version 4.5.0) does a very poor job >> >> generating single >> >> precision floating point code for ARM Cortex-A8. >> >> >> >> The source of this problem is the use of VFP instructions which are >> >> run on a >> >> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on >> >> RunFast mode >> >> (flush denormals to zero, disable exceptions) just provides a >> >> relatively minor >> >> performance gain. >> >> >> >> The right solution seems to be the use of NEON instructions for >> >> doing most of >> >> the single precision calculations. >> >> >> >> I wonder if it would be difficult to introduce the following >> >> changes to the >> >> gcc generated code when optimizing for cortex-a8: >> >> 1. Allocate single precision variables only to evenly or oddly >> >> numbered >> >> s-registers. >> >> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do >> >> 'vadd.f32 d0, d0, d1' instead. >> >> >> >> The number of single precision floating point registers gets >> >> effectively >> >> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky >> >> (packing/unpacking of register pairs may be needed to ensure proper >> >> parameters >> >> passing to functions). Also there may be other problems, like >> >> dealing with >> >> strict IEEE-754 compliance (maybe a special variable attribute for >> >> relaxing >> >> compliance requirements could be useful). But this looks like the >> >> only >> >> solution to fix poor performance on ARM Cortex-A8 processor. >> >> >> >> Actually clang 2.7 seems to be working exactly this way. And it is >> >> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single >> >> precision >> >> floating point tests that I tried on ARM Cortex-A8. >> > >> > On i?86 we have -mfpmath={sse,x87}, I suppose you could add >> > -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard >> > and requiring neon support). >> >> Except unlike sse, neon does not fully support IEEE support. So this >> should only be done with -ffast-math :). The point that it is slow is >> not good enough to change it to be something that is wrong and fast. >> > > We could document -mfpmath=neon as implying fast-math (or at least, no > denormals and default NaNs). If the user explicitly asks for floating > point to be done via the Neon unit, it would be somewhat churlish to say > "I don't believe you know what you're asking for... so I'll ignore you".
It's certainly reasonable to do that when the limitations are documented. Not enable -ffast-math, but simply state the correct facts for the MODE_HAS_* macros in real.h. Richard.