Richard Henderson <richard.hender...@linaro.org> writes:
> On 5/22/23 15:26, BALATON Zoltan wrote: >> On Mon, 22 May 2023, Alex Bennée wrote: >>> (ajb: add Richard for his compiler-fu) >>> BALATON Zoltan <bala...@eik.bme.hu> writes: >>>> On Mon, 22 May 2023, Alex Bennée wrote: >>>>> BALATON Zoltan <bala...@eik.bme.hu> writes: >>>>> >>>>>> The low level extract and deposit funtions provided by bitops.h are >>>>>> used in performance critical places. It crept into target/ppc via >>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU >>>>>> where hardfloat is also disabled is doubly affected. >>>>> >>>>> Most of these asserts compile out to nothing if the compiler is able to >>>>> verify the constants are in the range. For example examining >>>>> the start of float64_add: >>>>> >>> <snip> >>>>> >>>>> I don't see any check and abort steps because all the shift and mask >>>>> values are known at compile time. The softfloat compilation certainly >>>>> does have some assert points though: >>>>> >>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o |grep assert >>>>> 136: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND >>>>> g_assertion_mess[...] >>>>> 138: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __assert_fail >>>>> >>>>> but the references are for the ISRA segments so its tricky to know if >>>>> they get used or are just there for LTO purposes. >>>>> >>>>> If there are hot-paths that show up the extract/deposit functions I >>>>> suspect a better approach would be to implement _nocheck variants (or >>>>> maybe _noassert?) and use them where required rather than turning off >>>>> the assert checking for these utility functions. >>>> >>>> Just to clarify again, the asserts are still there when compiled with >>>> --enable-debug. The patch only turns them off for optimised release >>>> builds which I think makes sense if these asserts are to catch >>>> programming errors. >>> >>> Well as Peter said the general policy is to keep asserts in but I >>> appreciate this is a hotpath case. >>> >>>> I think I've also suggested adding noassert >>>> versions of these but that wasn't a popular idea and it may also not >>>> be easy to convert all places to use that like for example the >>>> register fields related usage in target/ppc as that would also affect >>>> other places. >>> >>> Is code generation or device emulation really on the hot-path. Generally >>> a well predicted assert is in the noise for those operations. >> They aren't in code generation but in helpers as you can also see in >> the profile below and so they can be on hot path. Also I've noticed >> that extract8 and extract16 just call extract32 after adding another >> assert on their own in addition to the one in extract32 which is >> double overhead for really no reason. I'd delete all these asserts >> as the likelhood of bugs these could catch is very low anyway (how >> often do you expect somebody to call these with out of bound values >> that would not be obvious from the results otherwise?) but leaving >> them in non-debug builds is totally useless in my opinion. >> >>>> So this seems to be the simplest and most effective >>>> approach. >>>> >>>> The softfloat related usage in these tests I've done seem to mostly >>>> come from unpacking and repacking floats in softfloat which is done >>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3 >>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even >>>> small overheads add app quickly. Just 1 muladd will result in 9 >>>> extracts and 2 deposits at least plus updating PPC flags for each FPU >>>> op adds a bunch more. I did some profiling with perf to find these. >>> >>> After some messing about trying to get lame to cross compile to a static >>> binary I was able to replicate what you've seen: >>> >>> 11.44% qemu-ppc64 qemu-ppc64 [.] unpack_raw64.isra.0 >>> 11.03% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal >>> 8.26% qemu-ppc64 qemu-ppc64 [.] >>> helper_compute_fprf_float64 >>> 6.75% qemu-ppc64 qemu-ppc64 [.] do_float_check_status >>> 5.34% qemu-ppc64 qemu-ppc64 [.] parts64_muladd >>> 4.75% qemu-ppc64 qemu-ppc64 [.] pack_raw64.isra.0 >>> 4.38% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize >>> 3.62% qemu-ppc64 qemu-ppc64 [.] >>> float64r32_round_pack_canonical >>> 3.32% qemu-ppc64 qemu-ppc64 [.] helper_todouble >>> 2.68% qemu-ppc64 qemu-ppc64 [.] float64_add >>> 2.51% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare >>> 2.30% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd >>> 1.80% qemu-ppc64 qemu-ppc64 [.] float64r32_mul >>> 1.40% qemu-ppc64 qemu-ppc64 [.] float64r32_add >>> 1.34% qemu-ppc64 qemu-ppc64 [.] parts64_mul >>> 1.16% qemu-ppc64 qemu-ppc64 [.] parts64_addsub >>> 1.14% qemu-ppc64 qemu-ppc64 [.] helper_reset_fpstatus >>> 1.06% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status >>> 1.04% qemu-ppc64 qemu-ppc64 [.] float64_muladd >> I've run 32 bit PPC version in qemu-system-ppc so the profile is a >> bit different (has more system related overhead that I plan to look >> at separately) but this part is similar to the above. I also wonder >> what makes helper_compute_fprf_float64 a bottleneck as that does not >> seem to have much extract/deposit, only a call to clz but it's hard >> to tell what it really does due to nested calls and macros. I've >> also seen this function among the top contenders in my profiling. >> >>> what I find confusing is the fact the parts extraction and packing >>> should all be known constants which should cause the asserts to >>> disappear. However it looks like the compiler decided to bring in a copy >>> of the whole inline function (ISRA = >interprocedural scalar replacement >>> of aggregates) which obviously can't fold the constants and eliminate >>> the assert. >> Could it be related to that while the parts size and start are >> marked const but pulled out of a struct field so the compiler may >> not know their actual value until run time? >> >>> Richard, >>> >>> Any idea of why the compiler might decide to do something like this? > > Try this. That seems to have done the trick, translated code is now dominating the profile: + 14.12% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000619420 + 13.30% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000616850 + 12.58% 12.19% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal + 10.62% 0.00% qemu-ppc64 [unknown] [.] 0x000000400061bf70 + 9.91% 9.73% qemu-ppc64 qemu-ppc64 [.] helper_compute_fprf_float64 + 7.84% 7.82% qemu-ppc64 qemu-ppc64 [.] do_float_check_status + 6.47% 5.78% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize.constprop.0 + 6.46% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000620130 + 6.42% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000619400 + 6.17% 6.04% qemu-ppc64 qemu-ppc64 [.] parts64_muladd + 5.85% 0.00% qemu-ppc64 [unknown] [.] 0x00000040006167e0 + 5.74% 0.00% qemu-ppc64 [unknown] [.] 0x0000b693fcffffd3 + 5.45% 4.78% qemu-ppc64 qemu-ppc64 [.] float64r32_round_pack_canonical + 4.79% 0.00% qemu-ppc64 [unknown] [.] 0xfaff203940fe8240 + 4.29% 3.82% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd + 4.20% 3.87% qemu-ppc64 qemu-ppc64 [.] helper_todouble + 3.51% 2.98% qemu-ppc64 qemu-ppc64 [.] float64r32_mul + 3.12% 2.97% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare + 3.06% 2.95% qemu-ppc64 qemu-ppc64 [.] float64_add + 2.88% 2.35% qemu-ppc64 qemu-ppc64 [.] float64r32_add + 2.65% 0.00% qemu-ppc64 [unknown] [.] 0x6c555e9100004039 + 2.55% 1.30% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status eyeballing the float functions and I can't see any asserts expressed. Before: Executed in 721.65 secs fish external usr time 721.38 secs 561.00 micros 721.38 secs sys time 0.20 secs 261.00 micros 0.20 secs After: Executed in 650.38 secs fish external usr time 650.11 secs 0.00 micros 650.11 secs sys time 0.20 secs 989.00 micros 0.20 secs > > > r~ > > [2. text/x-patch; z.patch]... -- Alex Bennée Virtualisation Tech Lead @ Linaro