On Tue, Nov 12, 2024 at 10:42:50PM +0000, Richard Sandiford wrote: > Sorry for the slow review. I think Jeff's much better placed to comment > on this than I am, but here's a stab. Mostly it looks really good to me > FWIW. > > Andrew Carlotti <andrew.carlo...@arm.com> writes: > > This pass is used to optimise assignments to the FPMR register in > > aarch64. I chose to implement this as a middle-end pass because it > > mostly reuses the existing RTL PRE code within gcse.cc. > > > > Compared to RTL PRE, the key difference in this new pass is that we > > insert new writes directly to the destination hardreg, instead of > > writing to a new pseudo-register and copying the result later. This > > requires changes to the analysis portion of the pass, because sets > > cannot be moved before existing instructions that set, use or clobber > > the hardreg, and the value becomes unavailable after any uses of > > clobbers of the hardreg. > > > > This patch would currently break any debug instructions that use the > > value of fpmr in a region of code where that value is changed by this > > pass. I haven't worked out the best way to fix this, but I suspect the > > issue is uncommon and tricky enough that it would be best to just drop > > those debug instructions. > > Yeah, good question, and pass on that :) Will need to think more about it.
I've looked into this a bit more, and there's some interesting quirks in the existing behaviour. It looks like we might always be ok at the moment, but it would be safer to add code to handle this properly. My current idea for a conservative approach to handle this is that if I detect any debug insn using the fpmr register, then I could create a new debug variable to replace it, and create assignments to this debug variable that clone all existing assignments to the fpmr register. Below are some dumps that helped me understand what's happening. A couple of points of interest: 1. I'm surprised that 045t.cddce1 is losing debug information for dead results of intrinsic calls. This is an issue for existing intrinsics as well (e.g. vrndmq_f16). I can vaguely see why this might be happening, but I wonder whether there's anything we can do better here. It's not really relevant to this patch, however, but it did present an extra barrier to getting the debug rtl I was wanting to examine. 2. I think 274r.cse1 is the first pass that can eliminate redundant fpmr assignments, and it looks like this will also create debug variables for the fpmr input to any debug_insns that use an fpmr value. I don't think any of the other passes between cse1 and hardreg-pre can break this, so I believe we get lucky here. However, as I say above, I think it would should check for and handle any debug_insn uses anyway. ------ Source ------ #include <arm_neon.h> float16x8_t bat(float16x8_t a, mfloat8x16_t b, mfloat8x16_t c) { a = vmlalbq_f16_mf8_fpm(a, b, c, 13); float16x8_t zxcx = vmlalbq_f16_mf8_fpm(a, b, c, 13); float16x8_t zxcy = vmlalbq_f16_mf8_fpm(zxcx, b, c, 13); return a; } ------ Built with ------ gcc fp8-debug.cc -S -g -Og -march=armv8-a+fp8 -fdump-tree-all -fdump-rtl-all ------ 043t.mergephi1 ------ <bb 2> : # DEBUG BEGIN_STMT a_6 = vmlalbq_f16_mf8_fpm (a_2(D), b_3(D), c_4(D), 13); # DEBUG a => a_6 # DEBUG BEGIN_STMT zxcx_8 = vmlalbq_f16_mf8_fpm (a_6, b_3(D), c_4(D), 13); # DEBUG zxcx => zxcx_8 # DEBUG BEGIN_STMT zxcy_10 = vmlalbq_f16_mf8_fpm (zxcx_8, b_3(D), c_4(D), 13); # DEBUG zxcy => zxcy_10 # DEBUG BEGIN_STMT return a_6; ------ 045t.cddce1 ------ <bb 2> : # DEBUG BEGIN_STMT a_6 = vmlalbq_f16_mf8_fpm (a_2(D), b_3(D), c_4(D), 13); # DEBUG a => a_6 # DEBUG BEGIN_STMT zxcx_8 = vmlalbq_f16_mf8_fpm (a_6, b_3(D), c_4(D), 13); # DEBUG zxcx => zxcx_8 # DEBUG BEGIN_STMT vmlalbq_f16_mf8_fpm (zxcx_8, b_3(D), c_4(D), 13); # DEBUG BEGIN_STMT return a_6; [Note that we've lost debug information for zxct here.] ------ 270r.into_cfglayout ------ (insn 15 14 16 2 (set (reg:DI 109) (const_int 13 [0xd])) "fp8-debug.cc":6:41 70 {*movdi_aarch64} (nil)) (insn 16 15 17 2 (set (reg:DI 84 fpmr) (reg:DI 109)) "fp8-debug.cc":6:41 70 {*movdi_aarch64} (nil)) (insn 17 16 18 2 (set (reg:V8HF 108) (unspec:V8HF [ (reg/v:V8HF 102 [ <retval> ]) (reg/v:V16QI 104 [ b ]) (reg/v:V16QI 105 [ c ]) (reg:DI 84 fpmr) ] UNSPEC_FP8TEST)) "fp8-debug.cc":6:41 5277 {fp8test} (nil)) (insn 18 17 19 2 (set (reg/v:V8HF 101 [ zxcx ]) (reg:V8HF 108)) "fp8-debug.cc":6:41 1272 {*aarch64_simd_movv8hf} (nil)) (debug_insn 19 18 20 2 (var_location:V8HF zxcx (reg/v:V8HF 101 [ zxcx ])) "fp8-debug.cc":6:41 -1 (nil)) (debug_insn 20 19 21 2 (debug_marker) "fp8-debug.cc":7:3 -1 (nil)) (insn 21 20 22 2 (set (reg:DI 111) (const_int 13 [0xd])) "fp8-debug.cc":7:41 70 {*movdi_aarch64} (nil)) (insn 22 21 23 2 (set (reg:DI 84 fpmr) (reg:DI 111)) "fp8-debug.cc":7:41 70 {*movdi_aarch64} (nil)) (insn 23 22 24 2 (set (reg:V8HF 110) (unspec:V8HF [ (reg/v:V8HF 101 [ zxcx ]) (reg/v:V16QI 104 [ b ]) (reg/v:V16QI 105 [ c ]) (reg:DI 84 fpmr) ] UNSPEC_FP8TEST)) "fp8-debug.cc":7:41 5277 {fp8test} (nil)) ------ 271r.jump = 273r.dfinit ------ (insn 15 14 16 2 (set (reg:DI 109) (const_int 13 [0xd])) "fp8-debug.cc":6:41 70 {*movdi_aarch64} (nil)) (insn 16 15 32 2 (set (reg:DI 84 fpmr) (reg:DI 109)) "fp8-debug.cc":6:41 70 {*movdi_aarch64} (nil)) (debug_insn 32 16 31 2 (var_location:V8HF D#2 (unspec:V8HF [ (reg/v:V8HF 102 [ <retval> ]) (reg/v:V16QI 104 [ b ]) (reg/v:V16QI 105 [ c ]) (reg:DI 84 fpmr) ] UNSPEC_FP8TEST)) -1 (nil)) (debug_insn 31 32 19 2 (var_location:V8HF D#1 (debug_expr:V8HF D#2)) -1 (nil)) (debug_insn 19 31 20 2 (var_location:V8HF zxcx (debug_expr:V8HF D#1)) "fp8-debug.cc":6:41 -1 (nil)) [Last UNSPEC_FP8 instruction was optimised away here, and the previous one converted to a debug_insn. We're temporarily using fpmr directly in a debug_insn.] ------ 274r.cse1 ------ (debug_insn 34 14 33 2 (var_location:DI D#4 (const_int 13 [0xd])) -1 (nil)) (debug_insn 33 34 32 2 (var_location:DI D#3 (debug_expr:DI D#4)) -1 (nil)) (debug_insn 32 33 31 2 (var_location:V8HF D#2 (unspec:V8HF [ (reg:V8HF 106) (debug_expr:V16QI D#6) (debug_expr:V16QI D#5) (debug_expr:DI D#3) ] UNSPEC_FP8TEST)) -1 (nil)) (debug_insn 31 32 19 2 (var_location:V8HF D#1 (debug_expr:V8HF D#2)) -1 (nil)) (debug_insn 19 31 20 2 (var_location:V8HF zxcx (debug_expr:V8HF D#1)) "fp8-debug.cc":6:41 -1 (nil)) [After cse, the fpmr assignment is also optimised away, at which point the debug_insn uses a debug variable for fpmr instead.]