On Thu, Jul 1, 2010 at 5:30 PM, Richard Earnshaw <rearn...@arm.com> wrote:
>
> On Wed, 2010-06-16 at 08:09 -0700, Andrew Pinski wrote:
>>
>> Sent from my iPhone
>>
>> On Jun 16, 2010, at 6:04 AM, Richard Guenther <richard.guent...@gmail.com
>>  > wrote:
>>
>> > On Wed, Jun 16, 2010 at 5:52 PM, Siarhei Siamashka
>> > <siarhei.siamas...@gmail.com> wrote:
>> >> Hello,
>> >>
>> >> Currently gcc (at least version 4.5.0) does a very poor job
>> >> generating single
>> >> precision floating point code for ARM Cortex-A8.
>> >>
>> >> The source of this problem is the use of VFP instructions which are
>> >> run on a
>> >> slow nonpipelined VFP Lite unit in Cortex-A8. Even turning on
>> >> RunFast mode
>> >> (flush denormals to zero, disable exceptions) just provides a
>> >> relatively minor
>> >> performance gain.
>> >>
>> >> The right solution seems to be the use of NEON instructions for
>> >> doing most of
>> >> the single precision calculations.
>> >>
>> >> I wonder if it would be difficult to introduce the following
>> >> changes to the
>> >> gcc generated code when optimizing for cortex-a8:
>> >> 1. Allocate single precision variables only to evenly or oddly
>> >> numbered
>> >> s-registers.
>> >> 2. Instead of using 'fadds s0, s0, s2' or similar instructions, do
>> >> 'vadd.f32 d0, d0, d1' instead.
>> >>
>> >> The number of single precision floating point registers gets
>> >> effectively
>> >> halved this way. Supporting '-mfloat-abi=hard' may be a bit tricky
>> >> (packing/unpacking of register pairs may be needed to ensure proper
>> >> parameters
>> >> passing to functions). Also there may be other problems, like
>> >> dealing with
>> >> strict IEEE-754 compliance (maybe a special variable attribute for
>> >> relaxing
>> >> compliance requirements could be useful). But this looks like the
>> >> only
>> >> solution to fix poor performance on ARM Cortex-A8 processor.
>> >>
>> >> Actually clang 2.7 seems to be working exactly this way. And it is
>> >> outperforming gcc 4.5.0 by up to a factor of 2 or 3 on some single
>> >> precision
>> >> floating point tests that I tried on ARM Cortex-A8.
>> >
>> > On i?86 we have -mfpmath={sse,x87}, I suppose you could add
>> > -mfpmath=neon for arm (properly conflicting with -mfloat-abi=hard
>> > and requiring neon support).
>>
>> Except unlike sse, neon does not fully support IEEE support. So this
>> should only be done with -ffast-math :). The point that it is slow is
>> not good enough to change it to be something that is wrong and fast.
>>
>
> We could document -mfpmath=neon as implying fast-math (or at least, no
> denormals and default NaNs).  If the user explicitly asks for floating
> point to be done via the Neon unit, it would be somewhat churlish to say
> "I don't believe you know what you're asking for... so I'll ignore you".

It's certainly reasonable to do that when the limitations are documented.
Not enable -ffast-math, but simply state the correct facts for
the MODE_HAS_* macros in real.h.

Richard.

Reply via email to