Re: [RFC PATCH 0/2] Enable hardfloat for PPC

BALATON Zoltan Mon, 17 Feb 2020 03:27:51 -0800

On Mon, 17 Feb 2020, Peter Maydell wrote:

On Mon, 17 Feb 2020 at 02:43, BALATON Zoltan <bala...@eik.bme.hu> wrote:

Hello,


This is an RFC series to start exploring the possibility of enabling
hardfloat for PPC target that haven't progressed in the last two years.
Hopefully we can work out something now. Previously I've explored this
here:

https://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html

where some ad-hoc benchmarks using lame mp3 encoder is also explained
that has two versions: one using VMX and another only using FP. Both
are mostly floating point bounded. I've run this test on mac99 under
MorphOS before and after my patches, also verifying that md5sum of
resulting mp3 matches (this is no proof for correctness but maybe
shows it did not break too much at least those ops used by this
program).

I hope others can contribute to this by doing more testing to find out
what else this would break or give some ideas how this could be
improved.


I think the ideal would be to test against a reference using
risu to see whether this changes behaviour (FP results should
be bit-for-bit identical; usually application level testing is
often not sufficient to detect this). You could test either

Sure, thanks. I did not mean to claim the simple test I've done wassufficient but I expect others who have interest in this and moreexperienced in such testing (or even being payed to work on QEMU which I'mnot) contribute to this so I did not try testing it more throughly thanjust showing it could be considerably faster and still work fot at leastsome workloads so it's worth working on. I'm surprised that in the twoyears since hardfloat was merged nobody even tried this (or those who diddropped the idea before any results without letting us know). So I triedto make a start with it to explore what would it take to fix thiseventually but I don't want to do that alone. I hope this inspires othersto help e.g. in thesting and we can reach a solution together.

against real hardware or against the non-hardfloat QEMU.
I'm not sure how comprehensive the coverage for ppc insns
is but there are a fair number of fp insns covered already:
https://git.linaro.org/people/peter.maydell/risu.git/tree/

I don't have real hardware and testing against QEMU may take longer andnot sure how useful. There could also be preexisting bugs, although somefixes were made to PPC FP implementation recently. Maybe I'll have a lookif have no better things to do but I have other ongoing QEMU relatedprojects as well that I might try to make some progress as well.

It's also worth testing any alternate/non-standard config
modes the FPU might have (eg different default rounding modes,
any flush-to-zero or alternate denormal handling, that kind
of thing), and not just the default how-the-CPU-boots-up mode.

It is expected to break inexact exceptions currently until a better waycan be found to handle those but I think hardfloat is already disabled forother than default rounding modes or FPU settings so maybe those shouldnot break. According to:


https://git.qemu.org/?p=qemu.git;a=blob;f=fpu/softfloat.c;h=301ce3b537b6c0eee5dbbc358587b66a3a341d2a;hb=HEAD#l235

 235 static inline bool can_use_fpu(const float_status *s)
 236 {
 237     if (QEMU_NO_HARDFLOAT) {
 238         return false;
 239     }
 240     return likely(s->float_exception_flags & float_flag_inexact &&
 241                   s->float_rounding_mode == float_round_nearest_even);
 242 }
 243

and

https://git.qemu.org/?p=qemu.git;a=blob;f=fpu/softfloat.c;h=301ce3b537b6c0eee5dbbc358587b66a3a341d2a;hb=HEAD#l99

  99 /*
 100  * Hardfloat
 101  *
 102  * Fast emulation of guest FP instructions is challenging for two reasons.
 103  * First, FP instruction semantics are similar but not identical, 
particularly
 104  * when handling NaNs. Second, emulating at reasonable speed the guest FP
 105  * exception flags is not trivial: reading the host's flags register with a
 106  * feclearexcept & fetestexcept pair is slow [slightly slower than 
soft-fp],
 107  * and trapping on every FP exception is not fast nor pleasant to work 
with.
 108  *
 109  * We address these challenges by leveraging the host FPU for a subset of 
the
 110  * operations. To do this we expand on the idea presented in this paper:
 111  *
 112  * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in 
a
 113  * binary translator." Software: Practice and Experience 46.12 
(2016):1591-1615.
 114  *
 115  * The idea is thus to leverage the host FPU to (1) compute FP operations
 116  * and (2) identify whether FP exceptions occurred while avoiding
 117  * expensive exception flag register accesses.
 118  *
 119  * An important optimization shown in the paper is that given that 
exception
 120  * flags are rarely cleared by the guest, we can avoid recomputing some 
flags.
 121  * This is particularly useful for the inexact flag, which is very 
frequently
 122  * raised in floating-point workloads.
 123  *
 124  * We optimize the code further by deferring to soft-fp whenever FP 
exception
 125  * detection might get hairy. Two examples: (1) when at least one operand 
is
 126  * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 
result
 127  * and the result is < the minimum normal.
 128  */
 129 #define GEN_INPUT_FLUSH__NOCHECK(name, soft_t)                          \
 130     static inline void name(soft_t *a, float_status *s)                 \
 131     {                                                                   \
 132         if (unlikely(soft_t ## _is_denormal(*a))) {                     \
 133             *a = soft_t ## _set_sign(soft_t ## _zero,                   \
 134                                      soft_t ## _is_neg(*a));            \
 135             s->float_exception_flags |= float_flag_input_denormal;      \
 136         }                                                               \
 137     }

Regards,
BALATON Zoltan

Re: [RFC PATCH 0/2] Enable hardfloat for PPC

Reply via email to