Hi Paul Please understand we know what we're talking about here :D
In summary: We want a new port that uses the hard floating point version of the EABI - floating point arguments to functions are passed in floating point registers (sN, dN, qN) as the ABI allows. This is to get around the fact that in soft (no FPU at all) and softfp mode (can use FPU) the EABI is defined such that all floating point arguments to a function are passed in integer registers - r0, r1 and so on, and in pairs of registers for double arguments. In the case where FPU instruction generation is enabled (softfp and fpu=vfpv3 for example - softfp abi does not imply an FPU), there is significant code inserted by the compiler that moves data from integer to floating point registers before the FPU can use it. As Konstantinos explained, this is something of the order of 6 moves around from integer to float and back again for a relatively simple function like sinf() which takes one floating point argument and returns one. vmov rN, sN has a penalty of around 20 cycles. Nothing is being done here, it stalls the entire pipeline until it is complete. You can't schedule around it. And it does this 6 times. The basic benefits for moving around are soft * FPU is all emulated. FPU work is done in integer registers. softfp+vfp * actual FPU used, FPU argument passing done in integer registers due to the soft/softfp EABI spec. your 10x speedup is here and comes from using the FPU instead of emulating it * You can use NEON here but you still are limited to passing float arguments in integer registers per the ABI * Each register transfer from integer to float register costs about 20 cycles * Boost in performance from using the FPU or NEON instead of emulation * Hidden performance penalty from the register transfers * Compatible with the above - soft and softfp code can be mixed hard+vfp * actual FPU is used in the same way * actual FPU code does not run faster * Boost in performance from using the FPU or NEON is the same * No hidden performance penalty * Completely incompatible ABI with the two above - no code mixing. That is what we're proposing. This, coupled with the benefits of compiling for an improved ISA (ARMv7-A instead of ARMv4) with better, more efficient instructions, potentially a slightly different strategy for scheduling instructions, and removing the need to run emulated FPU library code by specifying VFPv3-D16 as the base level of FPU required. Using VFPv3-D16 in the base system means not having to deal with Debian multilib just to get FPU code. Everything is FPU enabled by default. Debian multilib would be used to enable extra features such as NEON (which is still not in every ARMv7 processor) and the FP16 extension (which isn't present on any A8) That's what justifies the port, the fact that the ABI is incompatible, plus the baseline architecture requirement (it will no longer run on ARMv4 or ARMv5 .. ARMv6 is possible if you're lucky) plus the baseline FPU requirement (needs VFPv3-D16 at least). Why VFPv3-D16? Simply because VCVT and VMOV immediate has immediate optimization opportunities. Converting between integer and floating point is a very common need (think floor() and ceil() kind of stuff), and being able to put immediate values in FP registers is the first thing you learn when optimizing for AltiVec (vec_splat is your greatest friend!) to reduce the need to access memory which causes pipeline stalls. Because we're used to using AltiVec we think we'd absolutely, positively miss that functionality if we had to be restricted to VFPv2 which does not include them :) No, we don't need to do anything but change the ABI for the purpose of the port, but all the multilib mess of 10 different FPU types, slightly better architectures (5, 6) than the one the port is compiled for.. this is an opportunity to clean it up a bit and reduce the workload by standardizing at the very least to a common denominator (which just happens to be the Marvell ARMADA 500) while running well on Tegra2 and working at still pretty close to best performance possible on Snapdragon and iMX51 and OMAP3. I am fairly sure (oh you did!) find a contrived benchmark to show that some code is faster on softfp in some cases, but taking a holistic approach I find it hard to believe that every time a floating point function is called across any of 20,000 packages possibly running on a system in a Debian port, that you will be able to benchmark a softfp+vfp system running faster than a hard+vfp one, and the features outlined above in the VFPv3 spec, the ability to judge the benefits of VFP vs. NEON without compilers generating special magic in the way, will help people out and make for a "nicer" system. Any optimizations made on this "performance blend" of Debian for armv7+hard+vfp when it comes to NEON will backport easily to the "armel" port and work just the same with the same relative improvement, they just won't have the *base* performance of the port. Anyway I think everyone is agreed on that it should be done, just not the name.. -- Matt Sealey <m...@genesi-usa.com> Product Development Analyst, Genesi USA, Inc. -- To UNSUBSCRIBE, email to debian-arm-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/aanlktilaicbyoklaqxautojusrgvgwgu-uvoo-3f_...@mail.gmail.com