On 5/22/23 15:26, BALATON Zoltan wrote:
On Mon, 22 May 2023, Alex Bennée wrote:
(ajb: add Richard for his compiler-fu)
BALATON Zoltan <bala...@eik.bme.hu> writes:
On Mon, 22 May 2023, Alex Bennée wrote:
BALATON Zoltan <bala...@eik.bme.hu> writes:
The low level extract and deposit funtions provided by bitops.h are
used in performance critical places. It crept into target/ppc via
FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
where hardfloat is also disabled is doubly affected.
Most of these asserts compile out to nothing if the compiler is able to
verify the constants are in the range. For example examining
the start of float64_add:
<snip>
I don't see any check and abort steps because all the shift and mask
values are known at compile time. The softfloat compilation certainly
does have some assert points though:
readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o |grep assert
136: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND g_assertion_mess[...]
138: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __assert_fail
but the references are for the ISRA segments so its tricky to know if
they get used or are just there for LTO purposes.
If there are hot-paths that show up the extract/deposit functions I
suspect a better approach would be to implement _nocheck variants (or
maybe _noassert?) and use them where required rather than turning off
the assert checking for these utility functions.
Just to clarify again, the asserts are still there when compiled with
--enable-debug. The patch only turns them off for optimised release
builds which I think makes sense if these asserts are to catch
programming errors.
Well as Peter said the general policy is to keep asserts in but I
appreciate this is a hotpath case.
I think I've also suggested adding noassert
versions of these but that wasn't a popular idea and it may also not
be easy to convert all places to use that like for example the
register fields related usage in target/ppc as that would also affect
other places.
Is code generation or device emulation really on the hot-path. Generally
a well predicted assert is in the noise for those operations.
They aren't in code generation but in helpers as you can also see in the profile below and
so they can be on hot path. Also I've noticed that extract8 and extract16 just call
extract32 after adding another assert on their own in addition to the one in extract32
which is double overhead for really no reason. I'd delete all these asserts as the
likelhood of bugs these could catch is very low anyway (how often do you expect somebody
to call these with out of bound values that would not be obvious from the results
otherwise?) but leaving them in non-debug builds is totally useless in my opinion.
So this seems to be the simplest and most effective
approach.
The softfloat related usage in these tests I've done seem to mostly
come from unpacking and repacking floats in softfloat which is done
for every operation, e.g. muladd which mp3 encoding mostly uses does 3
unpacks and 1 pack for each call and each unpack is 3 extracts so even
small overheads add app quickly. Just 1 muladd will result in 9
extracts and 2 deposits at least plus updating PPC flags for each FPU
op adds a bunch more. I did some profiling with perf to find these.
After some messing about trying to get lame to cross compile to a static
binary I was able to replicate what you've seen:
11.44% qemu-ppc64 qemu-ppc64 [.] unpack_raw64.isra.0
11.03% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal
8.26% qemu-ppc64 qemu-ppc64 [.] helper_compute_fprf_float64
6.75% qemu-ppc64 qemu-ppc64 [.] do_float_check_status
5.34% qemu-ppc64 qemu-ppc64 [.] parts64_muladd
4.75% qemu-ppc64 qemu-ppc64 [.] pack_raw64.isra.0
4.38% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize
3.62% qemu-ppc64 qemu-ppc64 [.]
float64r32_round_pack_canonical
3.32% qemu-ppc64 qemu-ppc64 [.] helper_todouble
2.68% qemu-ppc64 qemu-ppc64 [.] float64_add
2.51% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare
2.30% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd
1.80% qemu-ppc64 qemu-ppc64 [.] float64r32_mul
1.40% qemu-ppc64 qemu-ppc64 [.] float64r32_add
1.34% qemu-ppc64 qemu-ppc64 [.] parts64_mul
1.16% qemu-ppc64 qemu-ppc64 [.] parts64_addsub
1.14% qemu-ppc64 qemu-ppc64 [.] helper_reset_fpstatus
1.06% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status
1.04% qemu-ppc64 qemu-ppc64 [.] float64_muladd
I've run 32 bit PPC version in qemu-system-ppc so the profile is a bit different (has more
system related overhead that I plan to look at separately) but this part is similar to the
above. I also wonder what makes helper_compute_fprf_float64 a bottleneck as that does not
seem to have much extract/deposit, only a call to clz but it's hard to tell what it really
does due to nested calls and macros. I've also seen this function among the top contenders
in my profiling.
what I find confusing is the fact the parts extraction and packing
should all be known constants which should cause the asserts to
disappear. However it looks like the compiler decided to bring in a copy
of the whole inline function (ISRA = >interprocedural scalar replacement
of aggregates) which obviously can't fold the constants and eliminate
the assert.
Could it be related to that while the parts size and start are marked const but pulled out
of a struct field so the compiler may not know their actual value until run time?
Richard,
Any idea of why the compiler might decide to do something like this?
Try this.
r~
diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 108f9cb224..42e6c188b4 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -593,27 +593,27 @@ static void unpack_raw64(FloatParts64 *r, const FloatFmt *fmt, uint64_t raw)
};
}
-static inline void float16_unpack_raw(FloatParts64 *p, float16 f)
+static void QEMU_FLATTEN float16_unpack_raw(FloatParts64 *p, float16 f)
{
unpack_raw64(p, &float16_params, f);
}
-static inline void bfloat16_unpack_raw(FloatParts64 *p, bfloat16 f)
+static void QEMU_FLATTEN bfloat16_unpack_raw(FloatParts64 *p, bfloat16 f)
{
unpack_raw64(p, &bfloat16_params, f);
}
-static inline void float32_unpack_raw(FloatParts64 *p, float32 f)
+static void QEMU_FLATTEN float32_unpack_raw(FloatParts64 *p, float32 f)
{
unpack_raw64(p, &float32_params, f);
}
-static inline void float64_unpack_raw(FloatParts64 *p, float64 f)
+static void QEMU_FLATTEN float64_unpack_raw(FloatParts64 *p, float64 f)
{
unpack_raw64(p, &float64_params, f);
}
-static void floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
+static void QEMU_FLATTEN floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
{
*p = (FloatParts128) {
.cls = float_class_unclassified,
@@ -623,7 +623,7 @@ static void floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
};
}
-static void float128_unpack_raw(FloatParts128 *p, float128 f)
+static void QEMU_FLATTEN float128_unpack_raw(FloatParts128 *p, float128 f)
{
const int f_size = float128_params.frac_size - 64;
const int e_size = float128_params.exp_size;
@@ -650,27 +650,27 @@ static uint64_t pack_raw64(const FloatParts64 *p, const FloatFmt *fmt)
return ret;
}
-static inline float16 float16_pack_raw(const FloatParts64 *p)
+static float16 QEMU_FLATTEN float16_pack_raw(const FloatParts64 *p)
{
return make_float16(pack_raw64(p, &float16_params));
}
-static inline bfloat16 bfloat16_pack_raw(const FloatParts64 *p)
+static bfloat16 QEMU_FLATTEN bfloat16_pack_raw(const FloatParts64 *p)
{
return pack_raw64(p, &bfloat16_params);
}
-static inline float32 float32_pack_raw(const FloatParts64 *p)
+static float32 QEMU_FLATTEN float32_pack_raw(const FloatParts64 *p)
{
return make_float32(pack_raw64(p, &float32_params));
}
-static inline float64 float64_pack_raw(const FloatParts64 *p)
+static float64 QEMU_FLATTEN float64_pack_raw(const FloatParts64 *p)
{
return make_float64(pack_raw64(p, &float64_params));
}
-static float128 float128_pack_raw(const FloatParts128 *p)
+static float128 QEMU_FLATTEN float128_pack_raw(const FloatParts128 *p)
{
const int f_size = float128_params.frac_size - 64;
const int e_size = float128_params.exp_size;