Three-source instructions on i965 have an annoying property that they cannot use immediate operands. They've do have the alluring property that they perform multiple operations in basically the same number of cycles as any other instruction. But when your arguments are immediates we decided that a MOV+MAD is basically going to be the same as a MUL+ADD (with immediates).
Two things we didn't consider is that Gen 7 hardware can co-issue some instructions (ADD, MUL, MAD included) if they're not using immediates, so MOV+MAD probably is better in practice. Secondly, immediates are used multiple times more often than not. For example in 2.0 * vec4 + 1.0, we don't actually need to load each constant four times. 2 MOVs + 4 MADs would be better than 4 MULs and 4 ADDs, especially when co-issuing is considered. This series adds some infrastructure to the control flow graph, including code to create the dominance tree which I use to figure out where to place MOV immediate instructions. It then adds a pass that runs after optimizations to collect immediates and selectively promote some to registers. The immediates are packed 8x per register. The last one lets us emit MAD instructions unconditionally, safe in the knowledge that the constant-combining pass will clean things up for us. The series works and passes piglit. It also cuts more than 3% of instruc- tions in affected programs, including huge reductions in select programs. But there's some work to do before it'll be finished. Since review is so hard to come by these days, I'm hoping people will have managed to take a look by the time I've solved the remaining problems. The remaining to do items are: Figure out if MAD instructions still co-issue if operands aren't aligned (e.g., mad dst.0, src0.0, src1.0, src2.3) If they don't, figure out whether packing operands is beneficial at all. Probably a bottom-up instruction scheduling pass to help sink MOV-imm (Currently losing a bunch of SIMD16 programs, I expect because of this) Modify instruction scheduler to estimate clock cycles Make shader-db handle this data Add a pass to insert destination dependency hints in to the FS, now that we're loading constants into the same register using mov(1). Emit 4x constants at once with the :VF type. (:V/:UV can't help us load 8x floats at once, unfortunately) Probably attempt some other constant loading tricks. I found a shader that loads 0.1, 0.2, ..., 0.8, 0.9. We could load 2.0-9.0 with two VF loads, 0.1 with a mov(1) and then do a mul(8), instead of 9 mov(1). Some opt_algebraic on MADs, now that their arguments can be immediates in the IR. Probably even some code to break MADs into MUL+ADD when many MADs perform the same multiplication. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev