Ping ping. I think there should be enough information in the first test to show that any "set to self” registers become live. Let me know if there’s anything I’ve missed.
Thanks, Alan. > On 12 Dec 2017, at 11:11, Alan Hayward <alan.hayw...@arm.com> wrote: > > Ping. > >> On 30 Nov 2017, at 11:03, Alan Hayward <alan.hayw...@arm.com> wrote: >> >> >>> On 27 Nov 2017, at 17:29, Jeff Law <l...@redhat.com> wrote: >>> >>> On 11/23/2017 04:11 AM, Alan Hayward wrote: >>>> >>>>> On 22 Nov 2017, at 17:33, Jeff Law <l...@redhat.com> wrote: >>>>> >>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote: >>>>>> >>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <l...@redhat.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd >>>>>>>>> totally forgotten about it. And in fact it seems to come pretty close >>>>>>>>> to what you need… >>>>>>>> >>>>>>>> Yes, some of the code is similar to the way >>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the >>>>>>>> CLOBBER expr code served as a starting point for writing the patch. >>>>>>>> The main difference >>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a >>>>>>>> specific Instruction, >>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied >>>>>>>> to an expression (tls_desc). >>>>>>>> It meant there wasn’t really any opportunity to resume any existing >>>>>>>> code. >>>>>>> Understood. Though your first patch mentions that you're trying to >>>>>>> describe partial preservation "around TLS calls". Presumably those are >>>>>>> represented as normal insns, not call_insn. >>>>>>> >>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg >>>>>>> to itself using whatever mode is wide enough to cover the neon part of >>>>>>> the register. >>>>>>> >>>>>>> That should tell the generic parts of the compiler that you're just >>>>>>> clobbering the upper part and at least in theory you can implement in >>>>>>> the aarch64 backend and the rest of the compiler should "just work" >>>>>>> because that's the existing semantics of a subreg store. >>>>>>> >>>>>>> The only worry would be if a pass tried to get overly smart and >>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply >>>>>>> wrong given the semantics of a partial store. >>>>>>> >>>>>> >>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X). >>>>>> It’s something we considered, and then dismissed. >>>>>> >>>>>> The problem then is you are now using SET semantics on those registers, >>>>>> and it >>>>>> would make the register live around the function, which might not be the >>>>>> case. >>>>>> Whereas clobber semantics will just make the register dead - which is >>>>>> exactly >>>>>> what we want (but only conditionally). >>>>> ?!? A set of the subreg is the *exact* semantics you want. It says the >>>>> low part is preserved while the upper part is clobbered across the TLS >>>>> insns. >>>>> >>>>> jeff >>>> >>>> Consider where the TLS call is inside a loop. The compiler would normally >>>> want >>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of >>>> the tls_desc we >>>> are now making x live across the loop, x is dependant on the value from >>>> the previous >>>> iteration, and the tls_desc can no longer be hoisted. >>> Hmm. I think I see the problem you're trying to point out. Let me >>> restate it and see if you agree. >>> >>> The low subreg set does clearly indicate the upper part of the SVE >>> register is clobbered. The problem is from a liveness standpoint the >>> compiler is considering the low part live, even though it's a self-set. >>> >> >> Yes. >> >>> In fact, if that is the case, then a single TLS call (independent of a >>> loop) would make the low part of the register globally live. This >>> should be testable. Include one of these low part self sets on the >>> existing TLS calls and compile a little test function and let's look at >>> the liveness data. >>> >> >> >> Ok, so I’ve put >> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM)) >> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM)) >> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the >> neon registers. >> >> In an ideal world, this change would do nothing as neon reg size is TI. >> >> First I compiled up sve_tls_preserve_1.c (Test1 below) >> >> Without the SET changes, gcc does not backup any neon registers before the >> tlsdesccall (good). >> With the SET changes, gcc backs up all the neon registers (not ideal). >> >> Without the SET changes, dump logs give: >> >> cse1: >> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc] >> >> cprop1: >> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap] >> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap] >> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89 >> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 >> [ap] >> ;; live gen 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89 >> ;; live kill 30 [x30] 66 [cc] >> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> >> reload: >> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] >> 34 [v2] 35 [v3] 36 [v4] 66 [cc] >> >> With the set changes: >> >> cse1: >> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 >> [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 >> [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 >> [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 >> [v29] 62 [v30] 63 [v31] 66 [cc] >> >> cprop1: >> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 64 [sfp] 65 [ap] >> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 64 [sfp] 65 [ap] >> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89 >> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 >> [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap] >> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] >> 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] >> 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 >> [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 >> [v31] 78 79 80 81 82 83 84 85 86 87 88 89 >> ;; live kill 30 [x30] 66 [cc] >> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> >> reload: >> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] >> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 >> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 >> [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 >> [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] >> >> >> Is there any extra dump data you think would be useful? >> >> >> For completeness, I also ran through some more tests….. >> >> >> >> Second I tried putting the tls call inside a simple for loop (Test2 below) >> >> Both before and after the SET changes, gcc hoisted the tlsdesccall call out >> of the loop. >> This has already happened before the expand pass. >> (and as before, with the SET changes, gcc backs up all the neon registers >> before the tlsdesccall). >> >> >> Third, I made the tls variable an array, and accessed elements from it in >> the loop. (Test3 below) >> >> Both before and after the SET changes, gcc fails to hoist the tlsdesccall >> call. >> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile >> that’d be to fix (at the >> tree level the tls load looks just like a MEM base plus index access. I >> guess you’d need something >> specific in an rtl pass to be the hoist). >> Anyway, we now have a tlsdec inside a loop, which is what I wanted. >> Without the SET changes, nothing is backed up whilst inside the loop. >> With the SET changes, gcc tries to back up the live neon registers, and ICEs >> with >> "error: unable to find a register to spill” in lra-assigns.c. >> >> >> Forth, I took the original test and made two tls accesses in a row with an >> asm volatile in-between (Test4 below). >> >> With the SET changes, gcc ICEs in the same way. >> Looking at the dump files, >> >> Without the SET changes, dump logs for the final test are similar to the >> very first test: >> >> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] >> 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 >> [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 >> [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 >> [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] >> >> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap] >> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap] >> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 >> 97 98 99 100 >> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 >> [ap] >> ;; live gen 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 >> 99 100 >> ;; live kill 30 [x30] 66 [cc] >> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> >> >> With the SET changes: >> >> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] >> 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc] >> >> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 64 [sfp] 65 [ap] >> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 64 [sfp] 65 [ap] >> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 >> [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 >> [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 >> [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 >> [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100 >> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 >> [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap] >> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] >> 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] >> 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 >> [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 >> [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100 >> ;; live kill 30 [x30] 66 [cc] >> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap] >> >> >> >> >> Test1: >> >> __thread v4si tx; >> >> v4si foo (v4si a, v4si b, v4si c) >> { >> v4si y; >> >> y = a + tx + b + c; >> >> return y + 7; >> } >> >> >> Test2: >> >> >> __thread v4si tx; >> >> v4si foo (v4si a, v4si *b, v4si *c) >> { >> v4si y; >> >> for(int i=0; i<MAX; i++) >> y += a + tx + b[i] + c[i]; >> >> return y + 7; >> } >> >> >> Test3: >> >> __thread v4si tx[MAX]; >> >> v4si foo (v4si a, v4si *b, v4si *c) >> { >> v4si y; >> >> for(int i=0; i<MAX; i++) >> y += a + tx[i] + b[i] + c[i]; >> >> return y + 7; >> } >> >> >> Test4: >> >> __thread v4si tx; >> __thread v4si ty; >> >> v4si foo (v4si a, v4si b, v4si c) >> { >> v4si y; >> >> y = a + tx + b + c; >> >> asm volatile ("" : : : "memory"); >> >> y += a + ty + b + c; >> >> return y + 7; >> } >> >> >>> >>> Now it could be the case that various local analysis could sub-optimally >>> handle things. You mention LICM. I know our original LICM did have a >>> problem in that if it saw a use of a hard reg in a loop without seeing a >>> set of that hard reg it considered the register varying within the loop. >>> I have no idea if we carried that forward when the loop code was >>> rewritten (when I looked at this it was circa 1992). >>> >>> >>>> >>>> Or consider a stream of code containing two tls_desc calls (ok, the >>>> compiler might >>>> optimise one of the tls calls away, but this approach should be reusable >>>> for other exprs). >>>> Between the two set(x,x)’s x is considered live so the register allocator >>>> can’t use that >>>> register. >>>> Given that we are applying this to all the neon registers, the register >>>> allocator now throws >>>> an ICE because it can’t find any free hard neon registers to use. >>> Given your statements it sounds like the liveness infrastructure is >>> making those neon regs globally live when it sees the low part subreg >>> self-set. Let's confirm that one way or the other and see where it >>> takes us. >>> >>> Jeff >> >