On Fri, Oct 24, 2014 at 1:45 AM, Francisco Jerez <curroje...@riseup.net> wrote: > Matt Turner <matts...@gmail.com> writes: > >> When I implemented these built-ins couple of years ago, I thought there >> must be a neat way to optimize them. I tried a couple of things with the >> different vector immediates i965 provides, but the V/UV types are too >> small to represent the appropriate shift values, and shift instructions >> can't shift by a floating-point source (if using the vector float imm). >> >> Curro pointed out that I could actually load the integer shift values >> with VF immediate just by doing a type converting move. How simple. >> >> So anyway, these optimizations are of pretty negligible value, except >> for maybe demonstrating that VF works. I've had them sitting on a branch >> for months, so time for them to live somewhere else. At least we can >> disassemble VF immediates now. >> >> I hope to have some more uses of VF immediates soon too. > > Hi Matt, > > a different approach I had in mind was to write an optimization pass > that would vectorize immediate moves by using VF where the original > arguments can be represented exactly as an 8-bit float. That would > probably help in many more cases than hand-optimizing a couple of > built-in operations -- That said, I guess it doesn't hurt to do this for > the time being until we have such an optimization pass.
A pass to emit VF immediates rather than 4x immediate moves seems like a good plan. Eric tried it (see the vf-immediates branch of his tree) but he ran into some problems -- specifically what do you do for constant folding when the result isn't also representable in VF. I think his approach to emit VF immediates while generating the backend IR might lend itself to more problems than, say, having an optimization that runs after everything else and changes some immediate moves to VF moves. Unfortunately a pass to do just this wouldn't be anywhere near sufficient to optimize these built-ins. You'd also need to have a pass to combine multiple scalar operations into vector operations, which I've implemented in the GLSL compiler but only for operations on different components of the same variable. For these built-ins, the vectorization pass would have to combine multiple scalar operations operating on /different/ registers, which sounds like a really hard problem. I've noticed a place this would help in at least one real vertex shader -- it did this (with some intervening instructions): dp3(8) g14<1>.xF g9<4,4,1>.xyzzF g9<4,4,1>.xyzzF dp3(8) g17<1>.xF g10<4,4,1>.xyzzF g10<4,4,1>.xyzzF dp3(8) g20<1>.xF g11<4,4,1>.xyzzF g11<4,4,1>.xyzzF math sqrt(8) g16<1>F g14<4,4,1>.xF null math sqrt(8) g25<1>F g20<4,4,1>.xF null math sqrt(8) g19<1>F g17<4,4,1>.xF null We could have done those (independent!) dp3's into the .xyz channels of a register and then just done a single sqrt instruction. It would have cut two instructions, but it also would have added a bunch of extra dependencies between otherwise independent instructions and would have lengthened some live ranges. So what I'm saying is -- yeah, having a pass that optimized what we have now into what we have after this series in a generic way would be great! Unfortunately it would also be an immense amount of work that might not end up being anything more than an open ended research project. If only I could find my magic wand... _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev