Chema Casanova <jmcasan...@igalia.com> writes: > El 20/07/18 a las 00:34, Francisco Jerez escribió: >> Chema Casanova <jmcasan...@igalia.com> writes: >> >>> El 14/07/18 a las 00:14, Francisco Jerez escribió: >>>> Jose Maria Casanova Crespo <jmcasan...@igalia.com> writes: >>>> >>>>> For a register source/destination of an instruction the function returns >>>>> the read/write byte pattern of a 32-byte registers as a unsigned int. >>>>> >>>>> The returned pattern takes into account the exec_size of the instruction, >>>>> the type bitsize, the stride and if the register is source or destination. >>>>> >>>>> The objective of the functions if to help to know the read/written bytes >>>>> of the instructions to improve the liveness analysis for partial >>>>> read/writes. >>>>> >>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL >>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize >>>>> parameter they have a different read pattern. >>>>> --- >>>>> src/intel/compiler/brw_fs.cpp | 183 +++++++++++++++++++++++++++++++++ >>>>> src/intel/compiler/brw_ir_fs.h | 1 + >>>>> 2 files changed, 184 insertions(+) >>>>> >>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp >>>>> index 2b8363ca362..f3045c4ff6c 100644 >>>>> --- a/src/intel/compiler/brw_fs.cpp >>>>> +++ b/src/intel/compiler/brw_fs.cpp >>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const >>>>> this->dst.offset % REG_SIZE != 0); >>>>> } >>>>> >>>>> +/** >>>>> + * Returns a 32-bit uint whose bits represent if the associated register >>>>> byte >>>>> + * has been read/written by the instruction. The returned pattern takes >>>>> into >>>>> + * account the exec_size of the instruction, the type bitsize and the >>>>> register >>>>> + * stride and the register is source or destination for the instruction. >>>>> + * >>>>> + * The objective of this function is to identify which parts of the >>>>> register >>>>> + * are read or written for operations that don't read/write a full >>>>> register. >>>>> + * So we can identify in live range variable analysis if a partial write >>>>> has >>>>> + * completelly defined the part of the register used by a partial read. >>>>> So we >>>>> + * avoid extending the liveness range because all data read was already >>>>> + * defined although the wasn't completely written. >>>>> + */ >>>>> +unsigned >>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const >>>>> +{ >>>>> + if (is_dst) { >>> >>>> Please split into two functions (like fs_inst::src_read and >>>> ::src_written) since that would make the call-sites of this method more >>>> self-documenting than a boolean parameter. You should be able to share >>>> code by refactoring the common logic into a separate function (see below >>>> for some suggestions on how that could be achieved). >>> >>> Sure, it would improve readability and simplifies the logic, I've chosen >>> dst_write_pattern and src_read_pattern. >>> >>>> >>>>> + /* We don't know what is written so we return the worts case */ >>>> >>>> "worst" >>> >>> Fixed. >>> >>>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL) >>>>> + return 0; >>>>> + /* We assume that send destinations are completely written */ >>>>> + if (this->is_send_from_grf()) >>>>> + return ~0u; >>>> >>>> Some send-like instructions won't be caught by this condition, you >>>> should check for this->mlen != 0 in addition. >>> >>> Would it be enough to check for (this->mlen > 0) and forget about >>> is_send_from_grf? I am using this approach in v2 I am sending. >>> >> >> I don't think the mlen > 0 condition would catch all cases either... >> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need both >> conditions. Sucks... > > That is true, so now we have the: > (this->is_send_from_grf() || this->mlen != 0) > >>>>> + } else { >>>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned >>>>> + * so the read pattern depends on the bitsize stored at src[4] >>>>> + */ >>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL && >>>>> + this->src[1].nr == r.nr) { >>> >>>> I feel uncomfortable about attempting to guess the source the caller is >>>> referring to by comparing the registers for equality. E.g. you could >>>> potentially end up with two sources that compare equal but have >>>> different semantics (e.g. as a result of CSE) which might cause it to >>>> get the wrong answer. It would probably be better to pass a source >>>> index and a byte offset as argument instead of an fs_reg. >>> >>> I've didn't thought about CSE, I'm now receiving the number of source >>> and the reg_offset. I'm using reg_offset instead of byte offsets as it >>> simplifies the logic. Now we are using always the base src register to >>> do all the calculation >>>>> + switch (this->src[4].ud) { >>>>> + case 32: >>>>> + return ~0u; >>>>> + case 16: >>>>> + return 0x33333333; >>>>> + case 8: >>>>> + return 0x11111111; >>>>> + default: >>>>> + unreachable("Unsupported bitsize at >>>>> byte_scattered_write_logical"); >>>>> + } >>>> >>>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>>> this->src[4].ud / 8)" (see below for the definition). >>> >>> Ok. >>> >>>>> + } >>>>> + /* As for byte_scattered_write_logical but we need to take into >>>>> account >>>>> + * that data written are in the payload offset 32 with SIMD8 and >>>>> offset >>>>> + * 64 with SIMD16. >>>>> + */ >>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE && >>>>> + this->src[0].nr == r.nr) { >>>>> + fs_reg payload = this->src[0]; >>>>> + payload.offset = REG_SIZE * this->exec_size / 8; >>>> >>>> byte_offset() is your friend. >>> >>> I've removed the overlap logic, and I'm just checking if we are in the >>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16. >>> >>>>> + if (regions_overlap(r, REG_SIZE, >>>>> + payload, REG_SIZE * this->exec_size / 8)) { >>>>> + switch (this->src[2].ud) { >>>>> + case 32: >>>>> + return ~0u; >>>>> + case 16: >>>>> + return 0x33333333; >>>>> + case 8: >>>>> + return 0x11111111; >>>>> + default: >>>>> + unreachable("Unsupported bitsize at >>>>> byte_scattered_write"); >>>>> + } >>>> >>>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>>> this->src[2].ud / 8)". >>> >>> Ok. >>> >>>>> + } else { >>>>> + return ~0u; >>>>> + } >>>>> + } >>>>> + } >>>>> + >>>>> + /* We define the most conservative value in order to calculate >>>>> liveness >>>>> + * range. If it is a destination nothing is defined and if is a source >>>>> + * all the bytes of the register could be read. So for release builds >>>>> + * the unreachables would have always safe return value. */ >>>>> + unsigned pattern = is_dst ? 0 : ~0u; >>>>> + >>>>> + /* In the general case we calculate the pattern for a specific >>>>> register >>>>> + * on base of the type_size and stride. We calculate the SIMD8 pattern >>>>> + * and then we adjust the patter if needed for different exec_sizes >>>>> + * and offset >>>>> + */ >>>>> + switch (type_sz(r.type)){ >>>>> + case 1: >>>>> + switch (r.stride) { >>>>> + case 0: >>>>> + pattern = 0X1; >>>>> + break; >>>>> + case 1: >>>>> + pattern = 0xff; >>>>> + break; >>>>> + case 2: >>>>> + pattern = 0x5555; >>>>> + break; >>>>> + case 4: >>>>> + pattern = 0x11111111; >>>>> + break; >>>>> + case 8: >>>>> + pattern = 0x01010101; >>>>> + break; >>>>> + default: >>>>> + unreachable("Unknown pattern unsupported 8-bit stride"); >>>>> + } >>>>> + break; >>>>> + case 2: >>>>> + switch (r.stride) { >>>>> + case 0: >>>>> + pattern = 0X3; >>>>> + break; >>>>> + case 1: >>>>> + pattern = 0xffff; >>>>> + break; >>>>> + case 2: >>>>> + pattern = 0x33333333; >>>>> + break; >>>>> + case 4: >>>>> + pattern = 0x03030303; >>>>> + break; >>>>> + case 8: >>>>> + pattern = 0x00030003; >>>>> + break; >>>>> + default: >>>>> + unreachable("Unknown pattern unsupported 16-bit stride"); >>>>> + } >>>>> + break; >>>>> + case 4: >>>>> + switch (r.stride) { >>>>> + case 0: >>>>> + pattern = 0Xf; >>>>> + break; >>>>> + case 1: >>>>> + pattern = ~0u; >>>>> + break; >>>>> + case 2: >>>>> + pattern = 0x0f0f0f0f; >>>>> + break; >>>>> + case 4: >>>>> + pattern = 0x000f000f; >>>>> + break; >>>>> + default: >>>>> + unreachable("Unknown pattern unsupported 32-bit stride"); >>>>> + } >>>>> + break; >>>>> + case 8: >>>>> + switch (r.stride) { >>>>> + case 0: >>>>> + pattern = 0Xff; >>>>> + break; >>>>> + case 1: >>>>> + pattern = ~0u; >>>>> + break; >>>>> + case 2: >>>>> + pattern = 0x00ff00ff; >>>>> + break; >>>>> + case 4: >>>>> + pattern = 0xff; >>>>> + break; >>>>> + default: >>>>> + unreachable("Unknown pattern unsupported 64-bit stride"); >>>>> + } >>>>> + break; >>>>> + default: >>>>> + unreachable("Unknown pattern for unsupported bitsize "); >>>>> + } >>>>> + >>>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < >>>>> REG_SIZE) { >>>>> + /* For exec_size greater than SIMD8 we repeat the pattern until it >>>>> + * represents a full register already represent a full register */ >>>>> + pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type))); >>>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < >>>>> REG_SIZE) >>>>> + pattern = pattern | (pattern << (16 * r.stride * >>>>> type_sz(r.type))); >>>>> + } else if (this->exec_size < 8 && >>>>> + r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) { >>>>> + /* For exec_size smaller than SIMD8 we reduce the pattern if its >>>>> size >>>>> + * is smaller than a full register. */ >>>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * >>>>> r.stride) - >>>>> + this->exec_size * type_sz(r.type) * >>>>> r.stride); >>>>> + } >>>>> + >>> >>>> This seems really mad, no clue whether it's correct. Why not replace >>>> the above ~110 lines with a call to the following (fully >>>> untested) 5-LOC function: >>> >>> Your suggestion seems to work perfectly fine, my original approach was >>> trying to avoid the loop of creating the read/write pattern but after >>> testing my v2 I wasn't able to notice any performance difference running >>> shader-db and having the same results. I was originally trying that >>> SIMD8 patterns were already constants. Sorry for the added complexity. >>> >> >> The loop runs for a logarithmic number of iterations though, so it has >> the exact same run-time complexity as your original patch, roughly the >> same amount of branches at run-time (but possibly less indirect >> branches!), and it should be compiled into a substantially lower number >> of instructions, which may actually cause it to perform better due to >> more favourable caching. It's hard to tell though which one will >> perform better in practice without benchmarking them, and as you >> probably realized this is so far from being a bottleneck that whatever >> the difference was is likely lost in the noise. So it really doesn't >> matter which one performs better... >> >>>> >>>> | uint32_t >>>> | periodic_mask(unsigned count, unsigned step, unsigned bits) >>>> | { >>>> | uint32_t m = (count ? (1 << bits) - 1 : 0); >>>> | const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS); >>>> | >>>> | for (unsigned shift = step; shift < max; shift *= 2) >>>> | m |= m << shift; >>>> | >>>> | return m; >>>> | } >>> >>> I've used your function just changing the sizeof(m) * CHAR_BITS for the >>> REG_SIZE, to not include the limits.h. >> >> My intention was to make the function as agnostic to IR details as >> possible: the only reason there is a limit of 32 bits is because that's >> the size of the type used to hold the return value. Using sizeof makes >> sure that e.g. extending the code to 64 bits is as simple as changing >> the datatype to uint64_t. >> >>> And I've also included an offset parameter that allows us to shift the >>> bits of the pattern when we have an offset inside the register. >>> >> >> That sounds fine. >> >>>>> + /* We adjust the pattern to the byte_offset of the register */ >>>>> + pattern = pattern << (r.offset % REG_SIZE); >>>>> + >>>> >>>> This doesn't really work except for r equal to the first GRF read by the >>>> source. Regions with non-zero sub-GRF offset that straddle multiple >>>> GRFs are not really very common at the IR level, but they're allowed by >>>> the hardware with some restrictions, so it would make sense to at least >>>> handle them conservatively in order to keep things from breaking silently. >>> >>> >>> I included an assert in the periodic_mask function to detect if the >>> combination of offset and mask goes over the REG_SIZE, so we would >>> detect an straddle at this level of the IR. >>> >>> assert(offset + max - (step - bits) <= REG_SIZE); > >> Uhm... What's the definition of "offset" here? You seem to be passing >> the offset relative to the start of the VGRF modulo REG_SIZE but that >> doesn't really make sense to me whenever "reg_offset" is non-zero. > > Yes I agree on removing this assert in the periodic_mask as it a generic > function. I was to focus on seen the integer as the register :) > > The motivation was because of the following cases that happen with > 8/16-bit more usually: > > (1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H }; > (2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H }; > > In the dst_write_pattern, at point of calling the general periodic_mask > case we know that instruction is not expected to be a send message > destination. So the pattern is the same for reg_offset = 0 and > reg_offset = 1. >
In this specific example, yes, but in general the pattern will not be the same for multiple registers for arbitrary fs_reg::offset values. > 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating > the pattern offset we would get 0xcccccccc for reg_offset=0 but > 0x333333333 for reg_offset=1 witch would be incorrect. > >> I think you want to pass "src[i].offset % REG_SIZE - reg_offset * >> REG_SIZE" as a signed integer in order to get the offset of the first >> byte actually written by the instruction relative to the first byte of >> the GRF window of the pattern. You don't really need to assert-fail >> when the offset is greater or equal to 32 (which shouldn't actually >> happen in practice), "return 0" gives you the correct behavior. For >> negative offsets (which means the pattern starts after the first byte >> written by the instruction) you can just return ~0u conservatively >> whenever the current logic wouldn't work [assuming you don't feel like >> implementing the code to handle that case accurately ;)]. > > Maybe you feel more comfortable with the following approach for > src_read_pattern.: > > For read sources we also assume that SENDs sources are completely > read so return ~0u except the byte_scattered_write source exceptions. > > if (this->is_send_from_grf() || this->mlen != 0) > return ~0u; > > So after this I think that we can assume that the following condition > should be always correct where source operand must reside within two > adjacent 256-bit registers so the pattern would be periodic for > all reg_offsets and we can use the (this->src[i].offset % REG_SIZE). > That's not a particularly futureproof assumption. I'm okay with handling non-reg_offset-periodic cases inaccurately for the moment for the sake of simplicity, but there is no reason to have the code blow up in such cases if you could just return ~0u. > assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ? > this->src[i].stride * type_sz(this->src[i].type) * this->exec_size : > type_sz(this->src[i].type), REG_SIZE)); > I don't think it's useful to assert-fail on this condition, since it doesn't handle all cases where the code below will be broken for a certain valid IR, and the condition it does protect against can be handled trivially in periodic_mask(), by returning zero when the 32-byte pattern window falls outside the region read or written by the instruction -- But currently you can't even know whether that's the case, because periodic_mask() is unaware where the 32-byte window starts relative to the source region, which is *all* it cares about, instead you're passing an offset to it that is equal to that in some special cases -- In all other cases the result of the function will be bogus. > So we could call in this scenario. > > return periodic_mask(this->exec_size, > this->src[i].stride * type_sz(this->src[i].type), > type_sz(this->src[i].type), > this->src[i].offset % REG_SIZE); > > The only issue I could think about that could generate issues would a > case that I didn't found in our current code where a source is defined > like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element > register region with interleaved rows (r5.0<1;8,2>:w)". > You couldn't really handle that even if you wanted to, because there is no way to represent such a thing at the IR level, since the horizontal and vertical strides cannot be controlled independently, only the stride member is meaningful for VGRFs. > What do you think? > > Chema > > >>> Thanks for the review, I think that v2 has better shape. >>> >> >> No problem. >> >>> Chema >>> >>>>> + assert(pattern); >>>>> + >>>>> + return pattern; >>>>> +} >>>>> + >>>>> + >>>>> unsigned >>>>> fs_inst::components_read(unsigned i) const >>>>> { >>>>> diff --git a/src/intel/compiler/brw_ir_fs.h >>>>> b/src/intel/compiler/brw_ir_fs.h >>>>> index 92dad269a34..5ea6294b8ad 100644 >>>>> --- a/src/intel/compiler/brw_ir_fs.h >>>>> +++ b/src/intel/compiler/brw_ir_fs.h >>>>> @@ -350,6 +350,7 @@ public: >>>>> bool equals(fs_inst *inst) const; >>>>> bool is_send_from_grf() const; >>>>> bool is_partial_write() const; >>>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) >>>>> const; >>>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) const; >>>>> unsigned components_read(unsigned i) const; >>>>> unsigned size_read(int arg) const; >>>>> -- >>>>> 2.17.1 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>> >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev