On Fri, May 27, 2016 at 7:05 PM, Francisco Jerez <curroje...@riseup.net> wrote:
> Skipping the temporary allocation and copy instructions is easy (just > return dst), but the conditions used to find out whether the copy can > be optimized out safely without breaking the program are rather > complex: The destination must be exactly one component of at most the > execution width of the lowered instruction, and all source regions of > the instruction must be either fully disjoint from the destination or > be aligned with it group by group. > --- > src/mesa/drivers/dri/i965/brw_fs.cpp | 81 > ++++++++++++++++++++++++++++++++++++ > 1 file changed, 81 insertions(+) > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > b/src/mesa/drivers/dri/i965/brw_fs.cpp > index 8eff27e..5d26c68 100644 > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > @@ -5054,6 +5054,78 @@ emit_unzip(const fs_builder &lbld, bblock_t *block, > fs_inst *inst, > } > > /** > + * Return true if splitting out the group of channels of instruction \p > inst > + * given by lbld.group() requires allocating a temporary for the > destination > + * of the lowered instruction and copying the data back to the original > + * destination region. > + * > + * Note that the result of this function may be different for different > + * channel groups in cases where the overlap between a source and > destination > + * region is partial. > + */ > +static inline bool > +needs_dst_copy(const fs_builder &lbld, const fs_inst *inst) > +{ > + /* If the instruction writes more than one component we'll have to > shuffle > + * the results of multiple lowered instructions in order to make sure > that > + * they end up arranged correctly in the original destination region. > + */ > + if (inst->regs_written * REG_SIZE > > + inst->dst.component_size(inst->exec_size)) > + return true; > + > + /* If the lowered execution size is larger than the original the > result of > + * the instruction won't fit in the original destination, so we'll > have to > + * allocate a temporary in any case. > + */ > + if (lbld.dispatch_width() > inst->exec_size) > + return true; > + > + for (unsigned i = 0; i < inst->sources; i++) { > + /* Index of this lowered instruction and total number of lowered > + * instructions. > + */ > + const unsigned j = lbld.group() / lbld.dispatch_width(); > + const unsigned n = DIV_ROUND_UP(inst->exec_size, > lbld.dispatch_width()); > + > + /* Specified channels from the source and destination regions. */ > + const fs_reg src = horiz_offset(inst->src[i], lbld.group()); > + const fs_reg dst = horiz_offset(inst->dst, lbld.group()); > + > + /* Lowered component size of the source and destination regions. */ > + const unsigned dsrc = > inst->src[i].component_size(lbld.dispatch_width()); > + const unsigned ddst = > inst->dst.component_size(lbld.dispatch_width()); > + > + /* If we already made a copy of the source for other reasons there > won't > + * be any overlap with the destination. > + */ > + if (!is_periodic(inst->src[i], lbld.dispatch_width()) && > + inst->components_read(i) != 1) > might be good to have a needs_src_copy function to keep the logic in one place. > + continue; > + > + /* We need a copy if the specified group of channels of the > destination > + * overlaps either the previous or subsequent groups of the > source. We > + * don't care whether the destination of a given lowered instruction > + * overlaps its own source because there's arguably no > + * source/destination hazard for this opcode if the source and > + * destination of the unlowered instruction were already > overlapping. > + * > + * Note that it might be possible to avoid emitting a copy in more > cases > + * by making assumptions about the ordering in which the lowered > + * instructions will be emitted (ugh!), but this has the advantage > that > + * the lowered instructions will be fully independent from each > other > + * regardless of the form of overlap so they can be scheduled > freely. > + */ > + if (regions_overlap(dst, ddst, inst->src[i], dsrc * j) || > + regions_overlap(dst, ddst, horiz_offset(src, > lbld.dispatch_width()), > + dsrc * (n - j - 1))) > + return true; > Wow, this is annoyingly complicated. It also appears to be correct. Would it be sufficient to simply use the following? if (regions_overlap(inst->dst, inst->regs_written * REG_SIZE, inst->src[i], inst->regs_read() * REG_SIZE) && !inst->dst.equals(inst->src[i])) return true; It's much simpler and covers the two cases we really care about. If that's not sufficient, what you have above does look correct. --Jason > + } > + > + return false; > +} > + > +/** > * Insert data from a packed temporary into the channel group given by > * lbld.group() of the destination region of instruction \p inst and > return > * the temporary as result. If any copy instructions are required they > will > @@ -5073,6 +5145,8 @@ emit_zip(const fs_builder &lbld, bblock_t *block, > fs_inst *inst) > const fs_reg dst = horiz_offset(inst->dst, lbld.group()); > const unsigned dst_size = inst->regs_written * REG_SIZE / > inst->dst.component_size(inst->exec_size); > + > + if (needs_dst_copy(lbld, inst)) { > const fs_reg tmp = lbld.vgrf(inst->dst.type, dst_size); > > if (inst->predicate) { > @@ -5090,6 +5164,13 @@ emit_zip(const fs_builder &lbld, bblock_t *block, > fs_inst *inst) > .MOV(offset(dst, inst->exec_size, k), offset(tmp, lbld, k)); > > return tmp; > + > + } else { > + /* No need to allocate a temporary for the lowered instruction, just > + * take the right group of channels from the original region. > + */ > + return dst; > + } > } > > bool > -- > 2.7.3 > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev