On Fri, 2019-01-11 at 08:05 +0100, Iago Toral wrote: > On Thu, 2019-01-10 at 13:18 -0600, Jason Ekstrand wrote: > > Topi just asked me on IRC what I thought about handling 16-bit > > booleans on Intel hardware in the light of the 1-bit boolean > > stuff. The current state of the driver is that we use > > nir_lower_bool_to_int32 pass to produce NIR that looks basically > > identical to the NIR we were getting in the back-end before. This > > lets us kick the can down the road a bit but I alluded in the 1-bit > > boolean series to ideas of doing something more intel-specific. > > Instead of answering on IRC, I thought I'd send a mesa-dev mail so > > that we can have a more universal discussion. > > > > ## The problem: > > > > On Intel hardware, comparison operations generate two results: a > > flag result which goes straight into the flag register and a > > destination result which goes into the GRF pointed to by the CMP > > instruction's destination. The flag result can be thought of as > > either a 32-bit bitfield scalar (in the sense of one for all > > threads) or as a per-thread 1-bit value. The GRF value is a per- > > thread value whose size matches that of the execution size of the > > instruction. If you're comparing two 64-bit integers or floats, it > > produces a 64-bit value (though I believe the top 32 bits are > > garbage). On a 32, 16, or 8-bit comparison, it produces a 32, 16, > > or 8-bit boolean respectively. The only reason why D3D booleans > > have historically been a good match for our hardware is because > > we've historically only really cared about 32-bit values. With 64- > > bit types, we could just do a conversion and write it off as "64- > > bit is expensive." In the new world if 8 and 16-bit types, > > however, that doesn't make nearly as much sense. > > > > ## Solutions: > > > > The real question is what size we should make booleans in the back- > > end. There are many different possible answers to this question > > but whatever happens, it should probably happen in NIR so that we > > can make choices while we're still in SSA. I've considered a few > > different ideas on what we could do: > > > > 1. Make everything 16-bit. 8-bit is clumsy because of the weird > > stride requirements but 32 and 64-bit can trivially be converted to > > 16-bit with a strided integer MOV. For the few places where we > > need an actual 32-bit bool (b2f), a signed integer up-cast will do > > the trick. For that matter, just using a mixed-size AND with W > > type for the bool and D type for the 0x3f800000c might do the trick > > and keep it one instruction. > > > > 2. Use the "native" boolean size for all comparison operations and > > then, whenever we need to combine booleans via iand, bcsel, or a > > phi, you make the result the smallest of the sources and insert > > extract_u16 or extract_u32 to do a down-cast for the larger > > sources. (We want the extract opcodes so we get a strided MOV which > > the back-end can more easily eliminate.) > > > > 3. Don't use comparison destinations at all and treat the flag as > > a 32 or 16-bit value (depending on dispatch width). You can do a > > boolean AND by just ANDing flag results and you have to write into > > the flag at the end in order to use it. This idea is a bit on the > > crazy side but it's interesting to think about. > > > > If idea 1 actually works, it would reduce register pressure a > > decent bit which would be a very good thing. However, I'm not sure > > how well we'll actually be able to optimize with it. > > I have 1) implemented (I was planning to send a series for review > that after we land the 8-bit and 16-bit series). I think it is > working quite well for me, but of course I only have the CTS tests to > play with. Here are some numbers: > > VK_KHR_shader_float16_int8 branch: > > | SIMD8 | SIMD16 > | > ------------------------------------------------------------------- > ---------- > spirv_assembly.type.scalar.i8.* | 19,725 | 2,044 > | > spirv_assembly.type.scalar.i16.* | 35,504 | 3,650 > | > instruction.graphics.float16.* | 305,129 | 29,760 > | > builtin.precision*.comparison.* | | 2,284 > | > > VK_KHR_shader_float16_int8 + 8-bit/16-bit booleans: > > | SIMD8 | SIMD16 > | > ------------------------------------------------------------------- > ---------- > spirv_assembly.type.scalar.i8.* | 19,718 | 2,043 > | > spirv_assembly.type.scalar.i16.* | 35,369 | 3,645 > | > instruction.graphics.float16.* | 302,764 | 29,627 > | > builtin.precision*.comparison.* | - | 2,144 > | > > I see benefits across the board. It is not a huge improvement, but > there is some. Getting 8-bit booleans to produce a lower number of > instructions took a bit more of work because the hardware doesn't > support 8-bit immediates, so the usual comparisons with 0 that we > emit (specifically for bcsel) would produdce worse code since we > would need to emit a MOV to a VGRF to handle the constant argument. I > fixed this by using a MOV.NZ to write the flag and then I had to > patch the CSE pass to work on MOV instructions witha NULL destination > which should be safe, and with that it is about the same as 32-bit > booleans, with maybe 1 or 2 instructions less in a few shaders I was > playing with, so I think it is probably worth a try. I'd definitely > do this for 16-bit booleans at least. One more thing about 8-bit booleans. Due to the hardware restrictions affecting Byte types, we end up having to align them all the time to Word,m so in practice I think they do not really bring an advantage and emitting 16-bit booelans for them might be a better solution unless we have reason to believe that Byte instructions have better ALU throughput to compensate for the extra hassle. > Another thing, the boolean lowering I have is not perfect. When we > find the need to make canonical booleans, or when we have undef > operands, I just take an asy way out, but I think we could probably > do something smater in some cases to reduce the number of conversions > which should allow us to produce even better instruction counts. > > I've just uploaded a branch here if you want to look at the > implementation of this I have right now (patches at the tip of the > branch, son top of the float16/int8 implementation): > > https://github.com/Igalia/mesa/tree/itoral/VK_KHR_shader_float16_int8_1bit_bool > > > > Regardless of what we do, we'll need some new NIR instructions. I > > think that was more Topi's direct question. I think the easiest > > thing to do would be to make 16 or 64-bit versions of the > > comparison instructions we have today. We could make the > > binop_compare helper in nir_opcodes.py just generate versions of > > the opcodes at all the bit sizes and call it a day. > > I added a bunch of opcodes in my branch, maybe that's enough? I guess > we could auto-generate some of those if we want. I am not sure if we > would need more stuff for GLSL, but for SPIR-V that seems to be all > that I needed going by the existing CTS tests. > > > In any case, there's my brain dump of ideas. I hope some of it is > > useful. I've tested none of it in practice. Have fun! > > > > --Jason > > > > _______________________________________________mesa-dev mailing > listmesa-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev