On Thu, 2019-01-10 at 13:18 -0600, Jason Ekstrand wrote: > Topi just asked me on IRC what I thought about handling 16-bit > booleans on Intel hardware in the light of the 1-bit boolean stuff. > The current state of the driver is that we use > nir_lower_bool_to_int32 pass to produce NIR that looks basically > identical to the NIR we were getting in the back-end before. This > lets us kick the can down the road a bit but I alluded in the 1-bit > boolean series to ideas of doing something more intel-specific. > Instead of answering on IRC, I thought I'd send a mesa-dev mail so > that we can have a more universal discussion. > > ## The problem: > > On Intel hardware, comparison operations generate two results: a flag > result which goes straight into the flag register and a destination > result which goes into the GRF pointed to by the CMP instruction's > destination. The flag result can be thought of as either a 32-bit > bitfield scalar (in the sense of one for all threads) or as a per- > thread 1-bit value. The GRF value is a per-thread value whose size > matches that of the execution size of the instruction. If you're > comparing two 64-bit integers or floats, it produces a 64-bit value > (though I believe the top 32 bits are garbage). On a 32, 16, or 8- > bit comparison, it produces a 32, 16, or 8-bit boolean respectively. > The only reason why D3D booleans have historically been a good match > for our hardware is because we've historically only really cared > about 32-bit values. With 64-bit types, we could just do a > conversion and write it off as "64-bit is expensive." In the new > world if 8 and 16-bit types, however, that doesn't make nearly as > much sense. > > ## Solutions: > > The real question is what size we should make booleans in the back- > end. There are many different possible answers to this question but > whatever happens, it should probably happen in NIR so that we can > make choices while we're still in SSA. I've considered a few > different ideas on what we could do: > > 1. Make everything 16-bit. 8-bit is clumsy because of the weird > stride requirements but 32 and 64-bit can trivially be converted to > 16-bit with a strided integer MOV. For the few places where we need > an actual 32-bit bool (b2f), a signed integer up-cast will do the > trick. For that matter, just using a mixed-size AND with W type for > the bool and D type for the 0x3f800000c might do the trick and keep > it one instruction. > > 2. Use the "native" boolean size for all comparison operations and > then, whenever we need to combine booleans via iand, bcsel, or a phi, > you make the result the smallest of the sources and insert > extract_u16 or extract_u32 to do a down-cast for the larger sources. > (We want the extract opcodes so we get a strided MOV which the back- > end can more easily eliminate.) > > 3. Don't use comparison destinations at all and treat the flag as a > 32 or 16-bit value (depending on dispatch width). You can do a > boolean AND by just ANDing flag results and you have to write into > the flag at the end in order to use it. This idea is a bit on the > crazy side but it's interesting to think about. > > If idea 1 actually works, it would reduce register pressure a decent > bit which would be a very good thing. However, I'm not sure how well > we'll actually be able to optimize with it.
I have 1) implemented (I was planning to send a series for review that after we land the 8-bit and 16-bit series). I think it is working quite well for me, but of course I only have the CTS tests to play with. Here are some numbers: VK_KHR_shader_float16_int8 branch: | SIMD8 | SIMD16 |------------------------------------------------------------------- ---------- spirv_assembly.type.scalar.i8.* | 19,725 | 2,044 |spirv_assembly.type.scalar.i16.* | 35,504 | 3,65 0 |instruction.graphics.float16.* | 305,129 | 2 9,760 |builtin.precision*.comparison.* | | 2,284 | VK_KHR_shader_float16_int8 + 8-bit/16-bit booleans: | SIMD8 | SIMD16 |------------------------------------------------------------------- ---------- spirv_assembly.type.scalar.i8.* | 19,718 | 2,043 |spirv_assembly.type.scalar.i16.* | 35,369 | 3,64 5 |instruction.graphics.float16.* | 302,764 | 2 9,627 |builtin.precision*.comparison.* | - | 2,144 | I see benefits across the board. It is not a huge improvement, but there is some. Getting 8-bit booleans to produce a lower number of instructions took a bit more of work because the hardware doesn't support 8-bit immediates, so the usual comparisons with 0 that we emit (specifically for bcsel) would produdce worse code since we would need to emit a MOV to a VGRF to handle the constant argument. I fixed this by using a MOV.NZ to write the flag and then I had to patch the CSE pass to work on MOV instructions witha NULL destination which should be safe, and with that it is about the same as 32-bit booleans, with maybe 1 or 2 instructions less in a few shaders I was playing with, so I think it is probably worth a try. I'd definitely do this for 16-bit booleans at least. Another thing, the boolean lowering I have is not perfect. When we find the need to make canonical booleans, or when we have undef operands, I just take an asy way out, but I think we could probably do something smater in some cases to reduce the number of conversions which should allow us to produce even better instruction counts. I've just uploaded a branch here if you want to look at the implementation of this I have right now (patches at the tip of the branch, son top of the float16/int8 implementation): https://github.com/Igalia/mesa/tree/itoral/VK_KHR_shader_float16_int8_1bit_bool > Regardless of what we do, we'll need some new NIR instructions. I > think that was more Topi's direct question. I think the easiest > thing to do would be to make 16 or 64-bit versions of the comparison > instructions we have today. We could make the binop_compare helper > in nir_opcodes.py just generate versions of the opcodes at all the > bit sizes and call it a day. I added a bunch of opcodes in my branch, maybe that's enough? I guess we could auto-generate some of those if we want. I am not sure if we would need more stuff for GLSL, but for SPIR-V that seems to be all that I needed going by the existing CTS tests. > In any case, there's my brain dump of ideas. I hope some of it is > useful. I've tested none of it in practice. Have fun! > > --Jason >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev