On Wed, Jun 05, 2013 at 10:06:08PM +0200, Segher Boessenkool wrote: > >I also wonder whether it would be useful to have 32-bit do the > >vector logical > >ops in gprs as well. At the moment, the patches don't allow it > >(vector types > >must be done in the altivec/vsx registers, an TImode is done by > >splitting the > >operation into 4 separate categories). On the 64-bit side, having > >__int128_t > >passed in GPRs, means you want to avoid ping-ponging between the > >GPRs and VSX > >registers. In addition, the atomic quad word support (patch #7) > >has to run in > >GPRs, so we need add/subtract/logical to have versions that run in > >GPRs. > > It might work better if you added a mode V1TI for TI in vector > regs, and then used plain TI only for GPRs. It certainly will > make things a lot more regular; whether it actually works better, > I have no idea. > > The way you have things now, only after reload the vector patterns > are split to GPR patterns; much too late to do most optimisations > on it. On the other hand, deciding early what register set some > op should go to isn't too pleasant either; is it always the best > choice to use the vector regs when possible?
It depends. For example consider: #ifndef TYPE #define TYPE __int128_t #endif TYPE a_and (TYPE p, TYPE q) { return p & q; } void p_and (TYPE *p, TYPE *q, TYPE *r) { *p = *q & *r; } In a_and, p and q are passed in GPRs, so you want to use the GPR based instructions. In p_and, it is simpler to do the instruction in the VSX registers. This is what my code from patch 4 generates: .L.a_and: and 3,3,5 and 4,4,6 blr .L.p_and: lxvd2x 12,0,4 lxvd2x 0,0,5 xxland 0,12,0 stxvd2x 0,0,3 blr Unfortunately when I added the TImode in VSX registers, I didn't notice this, and the current code generates: .L.a_and: addi 9,1,-16 std 3,0(9) std 4,8(9) ori 2,2,0 lxvd2x 12,0,9 std 5,0(9) std 6,8(9) ori 2,2,0 lxvd2x 0,0,9 xxland 0,12,0 stxvd2x 0,0,9 ori 2,2,0 ld 3,0(9) ld 4,8(9) blr .L.p_and: lxvd2x 12,0,4 lxvd2x 0,0,5 xxland 0,12,0 stxvd2x 0,0,3 blr Previous versions (and -mno-vsx-timode) generate: .L.a_and: and 3,3,5 and 4,4,6 blr .L.p_and: ld 10,0(4) ld 9,0(5) and 9,10,9 std 9,0(3) ld 10,8(4) ld 9,8(5) and 9,10,9 std 9,8(3) blr Note, that the scheduler does not interleave the loads and the and's, instead it does ld/ld/and/std. This bouncing back and forth will get somewhat worse when the support for doing 128int_t add/subtract in the vector registers is added. We don't want to hard wire doing all of TImode in vector registers, because this breaks the 8-byte atomic fetch_and_add functions (without having to use an UNSPEC to hide the add). -- Michael Meissner, IBM IBM, M/S 2506R, 550 King Street, Littleton, MA 01460, USA email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797