Am 15.04.2014 02:52, schrieb Andreas Hartmetz: >>> + /* right shift & convert, losing the low bit - must clear >>> + * high bit because there is no unsigned convert >>> instruction */> >>> sse2_psrld_imm(p->func, dataXMM, 1); >>> >>> + sse2_cvtdq2ps(p->func, dataXMM, dataXMM); >>> + >>> + /* convert low bit to float */ >>> + sse2_pslld_imm(p->func, dataXMM2, 31); >>> + sse2_psrld_imm(p->func, dataXMM2, 31); >>> + sse2_cvtdq2ps(p->func, dataXMM2, dataXMM2); >> >> Is this really ideal, wouldn't something like (in horrible pseudo-code >> notation) >> dataXMM2 = dataXMM2 & CONST(0x1 vec) >> dataXMM2 = cvtdq2ps(dataXMM2) >> be faster? >> I guess though your method avoids the constant, so probably not worth >> bothering (I am actually wondering what code llvm generates for its >> UIToFP instruction or what in general the fastest way to do this is). >> > Well, this whole sequence is somewhat wasteful for a single bit, but > it mattered in my application before I realized that it won't work > anyway on too many drivers. > > I was reluctant to add another register for the 0x1 constant(*); > loading it from memory each time seemed like it would take a lot of > bandwidth. Alternatively, loading an immediate into one 32 bit > (or smaller?) "sub register" and then copying it into the others > would amount to about the same instruction count after all. > Maybe there are more execution units available for doing it that way, > I don't know. Also I haven't found how to actually copy a value from > the lowest subregister into all others in one (fast) instruction. > > (*) The code in this file generally uses rather few registers. If > adding one to keep the 0x1 is not a problem, that's likely optimal. > > FWIW, the best I could get compilers to produce was four times > cvtsi2ss xmm0, rax - looks like the somewhat clever use of rax > with a non maxed out value range is a hardcoded pattern for the > conversion and the compilers have no means to be more "creative" > there. With x86 target I also saw a code sequence splitting the uint > value into two 16 bit values, converting them and then adding > them after multiplying the higher order bits by 0x10000. Repeated > four times... not sure if there is a good reason for that or if it's > just a compiler limitation that the code wasn't properly "SIMDed". > Every one(!) of those four iterations seemed at least equally > expensive to the whole sequence in this patch. > Compilers tested were GCC 4.8 -O3 and Clang trunk -O3. > > If anybody knows an optimal sequence I'd happily see that used > instead. I'm not sure how your code looked like but compilers aren't very good at auto-vectorization usually... I suspect the sequence using two 16bit values is probably the only solution if you want to do this correctly fully vectorized. I missed that previously but your code won't quite do the rounding correctly in all cases (and the c compiler has to follow correct round to nearest for int->float conversion). So a comment saying this doesn't quite get the most exact possible float value (for values > 2^25) in all cases would be nice. It is really annoying there's no uint->fp conversion instructions :-(.
Roland _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev