On Tue, Jul 11, 2023 at 5:01 PM Maciej W. Rozycki <ma...@embecosm.com> wrote: > > On Fri, 7 Jul 2023, Richard Biener wrote: > > > > The bb-slp-pr95839.c test assumes quad-single float vector support, but > > > some targets only support pairs of floats, causing this test to fail > > > with such targets. Limit this test to targets that support at least > > > 128-bit vectors then, and add a complementing test that can be run with > > > targets that have support for 64-bit vectors only. There is no need to > > > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smallest > > > vector of doubles, so support is implied by the presence of vectors of > > > doubles. > > > > I wonder why you see the testcase FAIL, on x86-64 when doing > > > > typedef float __attribute__((vector_size(32))) v4f32; > > > > v4f32 f(v4f32 a, v4f32 b) > > { > > /* Check that we vectorize this CTOR without any loads. */ > > return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3], > > a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]}; > > } > > > > I see we vectorize the add and the "store". We fail to perform > > extraction from the incoming vectors (unless you enable AVX), > > that's a missed optimization. > > > > So with paired floats I would expect sth similar? Maybe > > x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors. > > I am not familiar enough with this stuff to answer your question. > > As we pass and return V2SF data in FP registers just as with complex > float data with this hardware the function from my bb-slp-pr95839-v8.c > expands to a single vector FP add instruction, followed by a function > return. > > Conversely, the original function from bb-slp-pr95839.c expands to a > sequence of 22 instructions to extract incoming vector FP data from 4 > 64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add > instructions, and then insert outgoing vector FP data from 4 FPRs back to > 2 64-bit GPRs. As an experiment I have modified the backend minimally so > as to pass and return V4SF data in FP registers as well, but that didn't > make the vectoriser trigger. > > > That said, we should handle this better so can you file an > > enhancement bugreport for this? > > Filed as PR -optimization/110630.
Thanks! > I can't publish RISC-V information > related to the hardware affected, but as a quick check I ran the MIPS > compiler: > > $ mips-linux-gnu-gcc -march=mips64 -mabi=64 -mpaired-single -O2 -S > bb-slp-pr95839*.c > > and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot): > > jr $31 > add.ps $f0,$f12,$f13 > > vs code for bb-slp-pr95839.c: > > daddiu $sp,$sp,-64 > sd $5,24($sp) > sd $7,40($sp) > lwc1 $f0,24($sp) > lwc1 $f1,40($sp) > sd $4,16($sp) > sd $6,32($sp) > add.s $f3,$f0,$f1 > lwc1 $f0,28($sp) > lwc1 $f1,44($sp) > lwc1 $f4,36($sp) > swc1 $f3,56($sp) > add.s $f2,$f0,$f1 > lwc1 $f0,16($sp) > lwc1 $f1,32($sp) > swc1 $f2,60($sp) > add.s $f1,$f0,$f1 > lwc1 $f0,20($sp) > ld $3,56($sp) > add.s $f0,$f0,$f4 > swc1 $f1,48($sp) > swc1 $f0,52($sp) > ld $2,48($sp) > jr $31 > daddiu $sp,$sp,64 > > so this is essentially the same scenario (up to the machine instruction > count), and therefore it seems backend-agnostic. I can imagine the latter > case could expand to something like (instruction reordering surely needed > for performance omitted for clarity): > > dmtc1 $4,$f0 > dmtc1 $5,$f1 > dmtc1 $6,$f2 > dmtc1 $7,$f3 > add.ps $f0,$f0,$f1 > add.ps $f2,$f2,$f3 > dmfc1 $2,$f0 > jr $31 > dmfc1 $3,$f2 > > saving a lot of cycles, and removing the need for spilling temporaries to > the stack and for frame creation in the first place. > > Do you agree it still makes sense to include bb-slp-pr95839-v8.c with the > testsuite? Sure, more coverage is always nice. Richard. > Maciej