On Fri, 7 Jul 2023, Richard Biener wrote: > > The bb-slp-pr95839.c test assumes quad-single float vector support, but > > some targets only support pairs of floats, causing this test to fail > > with such targets. Limit this test to targets that support at least > > 128-bit vectors then, and add a complementing test that can be run with > > targets that have support for 64-bit vectors only. There is no need to > > adjust bb-slp-pr95839-2.c as 128 bits are needed even for the smallest > > vector of doubles, so support is implied by the presence of vectors of > > doubles. > > I wonder why you see the testcase FAIL, on x86-64 when doing > > typedef float __attribute__((vector_size(32))) v4f32; > > v4f32 f(v4f32 a, v4f32 b) > { > /* Check that we vectorize this CTOR without any loads. */ > return (v4f32){a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3], > a[4] + b[4], a[5] + b[5], a[6] + b[6], a[7] + b[7]}; > } > > I see we vectorize the add and the "store". We fail to perform > extraction from the incoming vectors (unless you enable AVX), > that's a missed optimization. > > So with paired floats I would expect sth similar? Maybe > x86 is saved by kind-of-presence (but disabled) of V8SFmode vectors.
I am not familiar enough with this stuff to answer your question. As we pass and return V2SF data in FP registers just as with complex float data with this hardware the function from my bb-slp-pr95839-v8.c expands to a single vector FP add instruction, followed by a function return. Conversely, the original function from bb-slp-pr95839.c expands to a sequence of 22 instructions to extract incoming vector FP data from 4 64-bit GPRs into 8 FPRs, add the vectors piecemeal with 4 scalar FP add instructions, and then insert outgoing vector FP data from 4 FPRs back to 2 64-bit GPRs. As an experiment I have modified the backend minimally so as to pass and return V4SF data in FP registers as well, but that didn't make the vectoriser trigger. > That said, we should handle this better so can you file an > enhancement bugreport for this? Filed as PR -optimization/110630. I can't publish RISC-V information related to the hardware affected, but as a quick check I ran the MIPS compiler: $ mips-linux-gnu-gcc -march=mips64 -mabi=64 -mpaired-single -O2 -S bb-slp-pr95839*.c and got this code for bb-slp-pr95839-v8.c (mind the branch delay slot): jr $31 add.ps $f0,$f12,$f13 vs code for bb-slp-pr95839.c: daddiu $sp,$sp,-64 sd $5,24($sp) sd $7,40($sp) lwc1 $f0,24($sp) lwc1 $f1,40($sp) sd $4,16($sp) sd $6,32($sp) add.s $f3,$f0,$f1 lwc1 $f0,28($sp) lwc1 $f1,44($sp) lwc1 $f4,36($sp) swc1 $f3,56($sp) add.s $f2,$f0,$f1 lwc1 $f0,16($sp) lwc1 $f1,32($sp) swc1 $f2,60($sp) add.s $f1,$f0,$f1 lwc1 $f0,20($sp) ld $3,56($sp) add.s $f0,$f0,$f4 swc1 $f1,48($sp) swc1 $f0,52($sp) ld $2,48($sp) jr $31 daddiu $sp,$sp,64 so this is essentially the same scenario (up to the machine instruction count), and therefore it seems backend-agnostic. I can imagine the latter case could expand to something like (instruction reordering surely needed for performance omitted for clarity): dmtc1 $4,$f0 dmtc1 $5,$f1 dmtc1 $6,$f2 dmtc1 $7,$f3 add.ps $f0,$f0,$f1 add.ps $f2,$f2,$f3 dmfc1 $2,$f0 jr $31 dmfc1 $3,$f2 saving a lot of cycles, and removing the need for spilling temporaries to the stack and for frame creation in the first place. Do you agree it still makes sense to include bb-slp-pr95839-v8.c with the testsuite? Maciej