> > + /* right shift & convert, losing the low bit - must clear > > + * high bit because there is no unsigned convert > > instruction */> > > sse2_psrld_imm(p->func, dataXMM, 1); > > > > + sse2_cvtdq2ps(p->func, dataXMM, dataXMM); > > + > > + /* convert low bit to float */ > > + sse2_pslld_imm(p->func, dataXMM2, 31); > > + sse2_psrld_imm(p->func, dataXMM2, 31); > > + sse2_cvtdq2ps(p->func, dataXMM2, dataXMM2); > > Is this really ideal, wouldn't something like (in horrible pseudo-code > notation) > dataXMM2 = dataXMM2 & CONST(0x1 vec) > dataXMM2 = cvtdq2ps(dataXMM2) > be faster? > I guess though your method avoids the constant, so probably not worth > bothering (I am actually wondering what code llvm generates for its > UIToFP instruction or what in general the fastest way to do this is). > Well, this whole sequence is somewhat wasteful for a single bit, but it mattered in my application before I realized that it won't work anyway on too many drivers.
I was reluctant to add another register for the 0x1 constant(*); loading it from memory each time seemed like it would take a lot of bandwidth. Alternatively, loading an immediate into one 32 bit (or smaller?) "sub register" and then copying it into the others would amount to about the same instruction count after all. Maybe there are more execution units available for doing it that way, I don't know. Also I haven't found how to actually copy a value from the lowest subregister into all others in one (fast) instruction. (*) The code in this file generally uses rather few registers. If adding one to keep the 0x1 is not a problem, that's likely optimal. FWIW, the best I could get compilers to produce was four times cvtsi2ss xmm0, rax - looks like the somewhat clever use of rax with a non maxed out value range is a hardcoded pattern for the conversion and the compilers have no means to be more "creative" there. With x86 target I also saw a code sequence splitting the uint value into two 16 bit values, converting them and then adding them after multiplying the higher order bits by 0x10000. Repeated four times... not sure if there is a good reason for that or if it's just a compiler limitation that the code wasn't properly "SIMDed". Every one(!) of those four iterations seemed at least equally expensive to the whole sequence in this patch. Compilers tested were GCC 4.8 -O3 and Clang trunk -O3. If anybody knows an optimal sequence I'd happily see that used instead. > Roland > <snip> Patch that fixes accidentally scalar mov follows. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev