Jason Ekstrand <ja...@jlekstrand.net> writes: > On Thu, Jul 23, 2015 at 3:55 AM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Francisco Jerez <curroje...@riseup.net> writes: >> >>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>> >>>> On Wed, Jul 22, 2015 at 10:05 AM, Francisco Jerez <curroje...@riseup.net> >>>> wrote: >>>>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>>>> >>>>>> On Wed, Jul 22, 2015 at 12:31 AM, Francisco Jerez >>>>>> <curroje...@riseup.net> wrote: >>>>>>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>>>>>> >>>>>>>> A few comments below. Mostly just asking for explanation. >>>>>>>> >>>>>>>> 1-3 are >>>>>>>> >>>>>>>> Reviewed-by: Jason Ekstrand <jason.ekstr...@intel.com> >>>>>>>> >>>>>>>> Obviously, don't merge 4/4 until it actually has users. >>>>>>>> --Jason >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 16, 2015 at 8:35 AM, Francisco Jerez >>>>>>>> <curroje...@riseup.net> wrote: >>>>>>>>> This lowering pass implements an algorithm to expand SIMDN >>>>>>>>> instructions into a sequence of SIMDM instructions in cases where the >>>>>>>>> hardware doesn't support the original execution size natively for some >>>>>>>>> particular instruction. The most important use-cases are: >>>>>>>>> >>>>>>>>> - Lowering send message instructions that don't support SIMD16 >>>>>>>>> natively into SIMD8 (several texturing, framebuffer write and typed >>>>>>>>> surface operations). >>>>>>>>> >>>>>>>>> - Lowering messages that don't support SIMD8 natively into SIMD16 >>>>>>>>> (*cough*gen4*cough*). >>>>>>>>> >>>>>>>>> - 64-bit precision operations (e.g. FP64 and 64-bit integer >>>>>>>>> multiplication). >>>>>>>>> >>>>>>>>> - SIMD32. >>>>>>>>> >>>>>>>>> The algorithm works by splitting the sources of the original >>>>>>>>> instruction into chunks of width appropriate for the lowered >>>>>>>>> instructions, and then interleaving the results component-wise into >>>>>>>>> the destination of the original instruction. The pass is controlled >>>>>>>>> by the get_lowered_simd_width() function that currently just returns >>>>>>>>> the original execution size making the whole pass a no-op for the >>>>>>>>> moment until some user is introduced. >>>>>>>>> --- >>>>>>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 142 >>>>>>>>> +++++++++++++++++++++++++++++++++++ >>>>>>>>> src/mesa/drivers/dri/i965/brw_fs.h | 1 + >>>>>>>>> 2 files changed, 143 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>>>> index d031352..eeb6938 100644 >>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>>>> @@ -3204,6 +3204,147 @@ fs_visitor::lower_logical_sends() >>>>>>>>> return progress; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +/** >>>>>>>>> + * Get the closest native SIMD width supported by the hardware for >>>>>>>>> instruction >>>>>>>>> + * \p inst. The instruction will be left untouched by >>>>>>>>> + * fs_visitor::lower_simd_width() if the returned value is equal to >>>>>>>>> the >>>>>>>>> + * original execution size. >>>>>>>>> + */ >>>>>>>>> +static unsigned >>>>>>>>> +get_lowered_simd_width(const struct brw_device_info *devinfo, >>>>>>>>> + const fs_inst *inst) >>>>>>>>> +{ >>>>>>>>> + switch (inst->opcode) { >>>>>>>>> + default: >>>>>>>>> + return inst->exec_size; >>>>>>>>> + } >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> + * The \p rows array of registers represents a \p num_rows by \p >>>>>>>>> num_columns >>>>>>>>> + * matrix in row-major order, write it in column-major order into >>>>>>>>> the register >>>>>>>>> + * passed as destination. \p stride gives the separation between >>>>>>>>> matrix >>>>>>>>> + * elements in the input in fs_builder::dispatch_width() units. >>>>>>>>> + */ >>>>>>>>> +static void >>>>>>>>> +emit_transpose(const fs_builder &bld, >>>>>>>>> + const fs_reg &dst, const fs_reg *rows, >>>>>>>>> + unsigned num_rows, unsigned num_columns, unsigned >>>>>>>>> stride) >>>>>>>> >>>>>>>> I'm not sure what I think about calling this "emit_transpose". I >>>>>>>> guess it is kind of a transpose, but maybe it's more of a "gather". >>>>>>>> I'm not going to quibble about it though. >>>>>>> >>>>>>> *Shrug*, it doesn't only gather the vectors of the rows array (that >>>>>>> would have been emit_collect :P), it copies them out in vertical, just >>>>>>> like a matrix transpose -- assuming you're not horrified by the idea of >>>>>>> considering the argument a matrix. >>>>>> >>>>>> That's fine. Feel free to ignore that suggestion. >>>>>> >>>>>>>> >>>>>>>>> +{ >>>>>>>>> + fs_reg *const components = new fs_reg[num_rows * num_columns]; >>>>>>>>> + >>>>>>>>> + for (unsigned i = 0; i < num_columns; ++i) { >>>>>>>>> + for (unsigned j = 0; j < num_rows; ++j) >>>>>>>>> + components[num_rows * i + j] = offset(rows[j], bld, stride >>>>>>>>> * i); >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + bld.LOAD_PAYLOAD(dst, components, num_rows * num_columns, 0); >>>>>>>>> + >>>>>>>>> + delete[] components; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> +bool >>>>>>>>> +fs_visitor::lower_simd_width() >>>>>>>>> +{ >>>>>>>>> + bool progress = false; >>>>>>>>> + >>>>>>>>> + foreach_block_and_inst_safe(block, fs_inst, inst, cfg) { >>>>>>>>> + const unsigned lower_width = get_lowered_simd_width(devinfo, >>>>>>>>> inst); >>>>>>>>> + >>>>>>>>> + if (lower_width != inst->exec_size) { >>>>>>>>> + /* Builder matching the original instruction. */ >>>>>>>>> + const fs_builder ibld = bld.at(block, inst) >>>>>>>>> + >>>>>>>>> .exec_all(inst->force_writemask_all) >>>>>>>>> + .group(inst->exec_size, >>>>>>>>> inst->force_sechalf); >>>>>>>>> + >>>>>>>>> + /* Split the copies in chunks of the execution width of >>>>>>>>> either the >>>>>>>>> + * original or the lowered instruction, whichever is lower. >>>>>>>>> + */ >>>>>>>>> + const unsigned copy_width = MIN2(lower_width, >>>>>>>>> inst->exec_size); >>>>>>>>> + const unsigned n = inst->exec_size / copy_width; >>>>>>>>> + const unsigned dst_size = inst->regs_written * REG_SIZE / >>>>>>>>> + inst->dst.component_size(inst->exec_size); >>>>>>>>> + fs_reg dsts[4]; >>>>>>>>> + >>>>>>>>> + assert(n > 0 && n <= ARRAY_SIZE(dsts) && >>>>>>>>> + !inst->writes_accumulator && !inst->mlen); >>>>>>>>> + >>>>>>>>> + for (unsigned i = 0; i < n; i++) { >>>>>>>>> + /* Emit a copy of the original instruction with the >>>>>>>>> lowered width. >>>>>>>>> + * If the EOT flag was set throw it away except for the >>>>>>>>> last >>>>>>>>> + * instruction to avoid killing the thread prematurely. >>>>>>>>> + */ >>>>>>>>> + fs_inst tmp_inst = *inst; >>>>>>>>> + tmp_inst.exec_size = lower_width; >>>>>>>>> + tmp_inst.eot = inst->eot && i == n - 1; >>>>>>>>> + >>>>>>>>> + /* Set exec_all if the lowered width is higher than the >>>>>>>>> original >>>>>>>>> + * to avoid breaking the compiler invariant that no >>>>>>>>> control >>>>>>>>> + * flow-masked instruction is wider than the shader's >>>>>>>>> + * dispatch_width. Then emit the lowered instruction. >>>>>> >>>>>> I don't think this is a standard invariant. The standard invariant is >>>>>> "set exec_all if you're doing something that doesn't match the natural >>>>>> exec mask". In fact, in the gen4 texturing code (which is what this >>>>>> is primarily for), we leave the writemask alone. >>>>> >>>>> The gen4 texturing code would have likely caused problems with >>>>> optimization passes that use the builder to emit new instructions based >>>>> on the channel enables of already existing instructions, precisely >>>>> because it used to break the invariant you mention (do something with an >>>>> exec mask higher than the natural without force_writemask_all). >>>> >>>> What do you mean by "higher than the natural"? >>> >>> I was trying to use your own words :P. The Gen4 texturing code was >>> using an exec mask larger than the shader's dispatch_width-wide one, so >>> I assume it doesn't fall into the category of instructions matching the >>> "natural" exec mask. >>> >>>> What is an optimization pass doing with exec_size that would get >>>> messed up? >>>> >>>> The optimization passes and most of the compiler beyond the initial >>>> NIR -> FS pass shouldn't care about dispatch_width. It should only >>>> ever care about the exec_sizes of particular instructions. Hey, look, >>>> a SIMD16 instruction! No big deal. I really don't know what these >>>> mythical optimization problems would be. >>>> >>> The problem is that optimization passes start off with a >>> dispatch_width-wide fs_builder, which isn't usable to emit non-exec_all >>> SIMD16 instructions in a SIMD8 shader, so they're going to hit an >>> assertion if they accidentally try to apply some transformation or emit >>> new instructions based on the exec_size of the original texturing >>> instruction. >>> >>> A (somewhat disturbing) solution would be not to hand the original >>> dispatch_width-wide fs_builder to optimization passes, but a, say, >>> SIMD32 one, so they would be able to emit non-exec_all instructions of >>> arbitrary exec_size, with potentially undefined results. I can do that >>> as PATCH 3.5 if you insist in getting rid of the exec_all() call in this >>> case. >> >> Heh... While doing this I noticed a few cases during >> lowering/optimization in which we would emit instructions without >> initializing the execution size correctly. Probably pre-existing bugs. >> I'll fix them in a separate series this one will depend on, and drop the >> exec_all() call from this patch. > > For the sake of getting some patches landed, I'm ok with landing this > one as-is and then fixing it up once exec_all isn't needed.
Sounds good to me, I'll do it as a follow-up. > --Jason > >>> >>>>>> If you do a SIMD16 instruction in SIMD8 mode without exec_all, you >>>>>> get garbage in the other 8 channels but it's otherwise fine. For >>>>>> instructions we need to "expand", this should be fine since we're >>>>>> throwing away the extra channels. >>>>>> >>>>>> On the other hand, we don't know what the other half of the writemask >>>>>> will be (experimentation indicates that it's *not* a duplicate of the >>>>>> first half) so not setting exec_all doesn't really gain us anything. >>>>>> Unless, of course, we could maybe save some memory bandwidth in >>>>>> texturing instructions if some of the channels are disabled. I doubt >>>>>> that matters much. >>>>>> >>>>> Yeah, in any case it can only have an effect on Gen4, and it will change >>>>> From writing garbage nondeterministically to some channels of the second >>>>> half to writing garbage deterministically. >>>>> >>>>>> My inclination would be to leave the exec_all alone since it's not >>>>>> needed, but I guess I'm not going to fight it too hard. >>>>>> >>>>> I wouldn't feel comfortable with doing that, it's surely going lead to >>>>> assertion failures in the future which will only be reproducible on Gen4 >>>>> and will therefore not be obvious for the casual contributor unless it's >>>>> being tested on that specifically. I guess the alternative would be to >>>>> relax the invariant, but it's proven to catch legitimate mistakes and >>>>> what the Gen4 code was doing had rather dubious semantics anyway, so I >>>>> doubt it's justified to change it... >>>>> >>>>>>>>> + */ >>>>>>>>> + const fs_builder lbld = ibld.exec_all(lower_width > >>>>>>>>> inst->exec_size) >>>>>>>>> + .group(lower_width, i); >>>>>>>>> + fs_inst *split_inst = lbld.emit(tmp_inst); >>>>>>>> >>>>>>>> Why are you emitting the split instruction before the source >>>>>>>> transposes? Don't they need to go first? Maybe I'm missing >>>>>>>> something. >>>>>>>> >>>>>>> I guess it doesn't really matter in which order they are inserted as >>>>>>> long as they end up in the correct order inside the program (and as you >>>>>>> can see below there is an at() call making sure that the builder used to >>>>>>> emit the transposes is pointing before the split instruction). >>>>>>> >>>>>>> No important reason really, other than to be able to assign the >>>>>>> temporaries allocated below to the split instruction sources directly. >>>>>> >>>>>> That's fine. It would be nice if you left a comment to that effect. >>>>>> Otherwise, it looks like you're emitting the send and then the moves. >>>>>> It's really easy when reading the code to miss the at() call below. >>>>>> >>>>> Sure. It would also be easy to swap the order of the transposes and the >>>>> split_inst emission if you find it easier to understand that way. I'll >>>>> just go do that instead of the comment if you have no objection. >>>>> >>>>>>>>> + >>>>>>>>> + for (unsigned j = 0; j < inst->sources; j++) { >>>>>>>>> + if (inst->src[j].file != BAD_FILE && >>>>>>>>> + !is_uniform(inst->src[j])) { >>>>>>>>> + /* Get the i-th copy_width-wide chunk of the >>>>>>>>> source. */ >>>>>>>>> + const fs_reg src = horiz_offset(inst->src[j], >>>>>>>>> copy_width * i); >>>>>>>>> + const unsigned src_size = inst->regs_read(j) * >>>>>>>>> REG_SIZE / >>>>>>>>> + inst->src[j].component_size(inst->exec_size); >>>>>>>>> + >>>>>>>>> + /* Use a trivial transposition to copy one every n >>>>>>>>> + * copy_width-wide components of the register into >>>>>>>>> a >>>>>>>>> + * temporary passed as source to the lowered >>>>>>>>> instruction. >>>>>>>>> + */ >>>>>>>>> + split_inst->src[j] = lbld.vgrf(inst->src[j].type, >>>>>>>>> src_size); >>>>>>>>> + emit_transpose(lbld.group(copy_width, 0).at(block, >>>>>>>>> split_inst), >>>>>>>>> + split_inst->src[j], &src, 1, >>>>>>>>> src_size, n); >>>>>>>> >>>>>>>> Shouldn't this be group(copy_width, i)? >>>>>>>> >>>>>>> >>>>>>> I'm already choosing the i-th channel group in the definition of lbld >>>>>>> while emitting the split instruction, this group() call is just to make >>>>>>> sure that the transposes don't copy data into the garbage channels of >>>>>>> the temporary in cases where copy_width is less than lower_width. >>>>>> >>>>>> Right. That makes sense. >>>>>> >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (inst->regs_written) { >>>>>>>>> + /* Allocate enough space to hold the result of the >>>>>>>>> lowered >>>>>>>>> + * instruction and fix up the number of registers >>>>>>>>> written. >>>>>>>>> + */ >>>>>>>>> + split_inst->dst = dsts[i] = >>>>>>>>> + lbld.vgrf(inst->dst.type, dst_size); >>>>>>>>> + split_inst->regs_written = >>>>>>>>> + DIV_ROUND_UP(inst->regs_written * lower_width, >>>>>>>>> + inst->exec_size); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (inst->regs_written) { >>>>>>>>> + /* Distance between useful channels in the temporaries, >>>>>>>>> skipping >>>>>>>>> + * garbage if the lowered instruction is wider than the >>>>>>>>> original. >>>>>>>>> + */ >>>>>>>>> + const unsigned m = lower_width / copy_width; >>>>>>>>> + >>>>>>>>> + /* Interleave the components of the result from the >>>>>>>>> lowered >>>>>>>>> + * instructions. We need to set exec_all() when copying >>>>>>>>> more than >>>>>>>>> + * one half per component, because LOAD_PAYLOAD (in >>>>>>>>> terms of which >>>>>>>>> + * emit_transpose is implemented) can only use the same >>>>>>>>> channel >>>>>>>>> + * enable signals for all of its non-header sources. >>>>>>>>> + */ >>>>>>>>> + emit_transpose(ibld.exec_all(inst->exec_size > >>>>>>>>> copy_width) >>>>>>>>> + .group(copy_width, 0), >>>>>>>>> + inst->dst, dsts, n, dst_size, m); >>>>>> >>>>>> While we're on the exec_all topic, I don't think it's needed here >>>>>> either. In the case where exec_size > copy_width, we're throwing away >>>>>> the second half. Therefore, the channels with the wrong exec mask >>>>>> will either not get emitted at all or dead-code will delete them (I'm >>>>>> not sure which without thinking about it harder). In either case, the >>>>>> MOV's that do matter will have the right exec mask and exec_all >>>>>> doesn't gain us anything. >>>>> >>>>> Nope, the "exec_size > copy_width" case is the usual (non-gen4) case. >>>>> Nothing can be thrown away in that case because both halves have to be >>>>> copied interleaved into the destination register, so whatever execution >>>>> mask LOAD_PAYLOAD uses it has to work for both halves, which is only >>>>> possible if force_writemask_all is set. >>>> >>>> Right... Not a huge fan of exec_all'd LOAD_PAYLOAD, but we don't have >>>> a ZIP instruction so I won't complain too much. >>>> >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + inst->remove(block); >>>>>>>>> + progress = true; >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> + if (progress) >>>>>>>>> + invalidate_live_intervals(); >>>>>>>>> + >>>>>>>>> + return progress; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> void >>>>>>>>> fs_visitor::dump_instructions() >>>>>>>>> { >>>>>>>>> @@ -3655,6 +3796,7 @@ fs_visitor::optimize() >>>>>>>>> int iteration = 0; >>>>>>>>> int pass_num = 0; >>>>>>>>> >>>>>>>>> + OPT(lower_simd_width); >>>>>>>>> OPT(lower_logical_sends); >>>>>>>>> >>>>>>>>> do { >>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h >>>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.h >>>>>>>>> index f3850d1..9582648 100644 >>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h >>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h >>>>>>>>> @@ -184,6 +184,7 @@ public: >>>>>>>>> bool lower_load_payload(); >>>>>>>>> bool lower_logical_sends(); >>>>>>>>> bool lower_integer_multiplication(); >>>>>>>>> + bool lower_simd_width(); >>>>>>>>> bool opt_combine_constants(); >>>>>>>>> >>>>>>>>> void emit_dummy_fs(); >>>>>>>>> -- >>>>>>>>> 2.4.3 >>>>>>>>>
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev