Richard Henderson <richard.hender...@linaro.org> writes:

> On 5/22/23 15:26, BALATON Zoltan wrote:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> (ajb: add Richard for his compiler-fu)
>>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>> BALATON Zoltan <bala...@eik.bme.hu> writes:
>>>>>
>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>
>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>> verify the constants are in the range. For example examining
>>>>> the start of float64_add:
>>>>>
>>> <snip>
>>>>>
>>>>> I don't see any check and abort steps because all the shift and mask
>>>>> values are known at compile time. The softfloat compilation certainly
>>>>> does have some assert points though:
>>>>>
>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND 
>>>>> g_assertion_mess[...]
>>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>>
>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>> they get used or are just there for LTO purposes.
>>>>>
>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>> the assert checking for these utility functions.
>>>>
>>>> Just to clarify again, the asserts are still there when compiled with
>>>> --enable-debug. The patch only turns them off for optimised release
>>>> builds which I think makes sense if these asserts are to catch
>>>> programming errors.
>>>
>>> Well as Peter said the general policy is to keep asserts in but I
>>> appreciate this is a hotpath case.
>>>
>>>> I think I've also suggested adding noassert
>>>> versions of these but that wasn't a popular idea and it may also not
>>>> be easy to convert all places to use that like for example the
>>>> register fields related usage in target/ppc as that would also affect
>>>> other places.
>>>
>>> Is code generation or device emulation really on the hot-path. Generally
>>> a well predicted assert is in the noise for those operations.
>> They aren't in code generation but in helpers as you can also see in
>> the profile below and so they can be on hot path. Also I've noticed
>> that extract8 and extract16 just call extract32 after adding another
>> assert on their own in addition to the one in extract32 which is
>> double overhead for really no reason. I'd delete all these asserts
>> as the likelhood of bugs these could catch is very low anyway (how
>> often do you expect somebody to call these with out of bound values
>> that would not be obvious from the results otherwise?) but leaving
>> them in non-debug builds is totally useless in my opinion.
>> 
>>>> So this seems to be the simplest and most effective
>>>> approach.
>>>>
>>>> The softfloat related usage in these tests I've done seem to mostly
>>>> come from unpacking and repacking floats in softfloat which is done
>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>
>>> After some messing about trying to get lame to cross compile to a static
>>> binary I was able to replicate what you've seen:
>>>
>>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>>   8.26%  qemu-ppc64  qemu-ppc64               [.] 
>>> helper_compute_fprf_float64
>>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>>   3.62%  qemu-ppc64  qemu-ppc64               [.] 
>>> float64r32_round_pack_canonical
>>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>> bit different (has more system related overhead that I plan to look
>> at separately) but this part is similar to the above. I also wonder
>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>> seem to have much extract/deposit, only a call to clz but it's hard
>> to tell what it really does due to nested calls and macros. I've
>> also seen this function among the top contenders in my profiling.
>> 
>>> what I find confusing is the fact the parts extraction and packing
>>> should all be known constants which should cause the asserts to
>>> disappear. However it looks like the compiler decided to bring in a copy
>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>> of aggregates) which obviously can't fold the constants and eliminate
>>> the assert.
>> Could it be related to that while the parts size and start are
>> marked const but pulled out of a struct field so the compiler may
>> not know their actual value until run time?
>> 
>>> Richard,
>>>
>>> Any idea of why the compiler might decide to do something like this?
>
> Try this.

That seems to have done the trick, translated code is now dominating the
profile:

+   14.12%     0.00%  qemu-ppc64  [unknown]                [.] 
0x0000004000619420
+   13.30%     0.00%  qemu-ppc64  [unknown]                [.] 
0x0000004000616850
+   12.58%    12.19%  qemu-ppc64  qemu-ppc64               [.] 
parts64_uncanon_normal
+   10.62%     0.00%  qemu-ppc64  [unknown]                [.] 
0x000000400061bf70
+    9.91%     9.73%  qemu-ppc64  qemu-ppc64               [.] 
helper_compute_fprf_float64
+    7.84%     7.82%  qemu-ppc64  qemu-ppc64               [.] 
do_float_check_status
+    6.47%     5.78%  qemu-ppc64  qemu-ppc64               [.] 
parts64_canonicalize.constprop.0
+    6.46%     0.00%  qemu-ppc64  [unknown]                [.] 
0x0000004000620130
+    6.42%     0.00%  qemu-ppc64  [unknown]                [.] 
0x0000004000619400
+    6.17%     6.04%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
+    5.85%     0.00%  qemu-ppc64  [unknown]                [.] 
0x00000040006167e0
+    5.74%     0.00%  qemu-ppc64  [unknown]                [.] 
0x0000b693fcffffd3
+    5.45%     4.78%  qemu-ppc64  qemu-ppc64               [.] 
float64r32_round_pack_canonical
+    4.79%     0.00%  qemu-ppc64  [unknown]                [.] 
0xfaff203940fe8240
+    4.29%     3.82%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
+    4.20%     3.87%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
+    3.51%     2.98%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
+    3.12%     2.97%  qemu-ppc64  qemu-ppc64               [.] 
float64_hs_compare
+    3.06%     2.95%  qemu-ppc64  qemu-ppc64               [.] float64_add
+    2.88%     2.35%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
+    2.65%     0.00%  qemu-ppc64  [unknown]                [.] 
0x6c555e9100004039
+    2.55%     1.30%  qemu-ppc64  qemu-ppc64               [.] 
helper_float_check_status

eyeballing the float functions and I can't see any asserts expressed.

Before:

Executed in  721.65 secs    fish           external
   usr time  721.38 secs  561.00 micros  721.38 secs
   sys time    0.20 secs  261.00 micros    0.20 secs

After:

Executed in  650.38 secs    fish           external
   usr time  650.11 secs    0.00 micros  650.11 secs
   sys time    0.20 secs  989.00 micros    0.20 secs



>
>
> r~
>
> [2. text/x-patch; z.patch]...


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to