Jason Ekstrand <ja...@jlekstrand.net> writes: > I haven't said much about this series up until now. I've mostly sat > and watched and focused my time on other things. As I said in the > meeting today, I think that part of the problem here is that there are > at least 3 "refactors" in here besides the new feature. By > "refactors", I mean "new ways of solving problem X". Part of the > problem with the discussion is that all of this has been conflated and > we aren't actually talking about how to solve each of those problems. > I'm going to try and break it out here to try and aid in the > discussion. > Thanks a lot Jason for giving a more constructive turn to this discussion. :)
> 1) The builder. I think *everyone* likes the builder. The argument > here is not over whether or not we want it. People have expressed > concern that adding the builder now without actually doing the > refactor will result in duplicated code that will get out-of-sync. I > think the best solution here is to go ahead and do the builder stuff > immediately after the branch point. > Agreed. > 2) SIMD8 instruction splitting. As I said in the call, we've done > this a variety of ways in the past usually by emitting the split > instructions in the visitor and manually using half() and > inst->force_sechalf. We also have a few places that I think still > split the code in the generator. What you're doing is a somewhat more > elegant version of the "emit it in the visitor" stuff that we've done > in the past. You've got your zip/unzip functions etc. to handle > splitting and combining and then you have a loop to actually emit the > sends. Really, it's fairly nice and concise. The question is whether > or not it's really what we want to do (more on this in a bit). > There is another closely related problem that this infrastructure solves and may not be obvious in this series (it was in my original branch though [1]), but I think it's worth keeping in mind, I'll call it 2.1: Some of the surface messages lacked SIMD4x2 variants, so a strided copy is required to transpose the original AoS vector into SoA form. This turns out to be algorithmically identical to the strided copy required to "unzip" a SIMD16 vector, so it can be accomplished without additional effort. These transformations are largely transparent for the functions building typed/untyped surface messages, they are hidden behind an interface that doesn't seem to have received any attention yet: - struct vector_layout: This represents the layout of a vector in a payload. It's initialized based on the restrictions of the hardware for the specific message we want to send, and the current code generation parameters (i.e. SIMD mode). - emit_insert(): This takes a vector with the layout native to the EU (e.g. an fs_reg) and performs the required transformations to turn it into a form suitable for the shared unit according to a vector_layout struct (what may involve unzipping, a strided copy or no work at all). - emit_extract(): The converse of emit_insert(). This takes a payload fragment (array_reg) and transforms it into a vector of native layout, according to a vector_layout struct. My intention was to come up with an interface more general than SIMD16 vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also thinking texturing instruction payload quirks (which might require padding or different permutations of the same values depending on the generation even if the SIMD mode is natively supported -- let's call this 2.2) and framebuffer writes (2.3). > 3) Array reg. This is something of a helper to (2). In fact the primary motivation for array_reg wasn't instruction splitting, but emit_collect(), more on that later. > I agree that it's nice to be able to have access to the size of a > register without having to look at the allocator. The problem here > (and why we don't have a size in fs_reg) is that we copy the reigster > around. Every register does have some sort of "underlying storage" > that it shares with other registers with the same number. But that > storage is represented by the size in the allocator and nothing else. > Is that bad? Maybe. Should we have a way of passing the size around > with it? Maybe. In any case, I think it's best to table this > discussion until we've talked a bit about (2) because I think the > resolution there will make (3) more clear. > I think the mechanism we use to determine the size of a register hardly ever involves the allocator, and there are good reasons not to rely on it: It only works for VGRFs, can easily break with register coalescing, and it makes "slicing" and other zero-copy transformations of registers difficult (e.g. taking an individual element out of a VGRF array without copying it into a one-component register). Typically the size of a register is implicitly assumed by the instruction using it, and where it's not (because the instruction expects a variable length payload) it is provided separately from the register via "mlen". Neither of these two options works here as will be obvious shortly. > One note on 3 before we table it. People have been pulling > emit_collect out as a straw-man and beating on it but I really do like > it in principle. There are piles of times where you want a payload > and you have to allocate an array, do stuff, and then put it into the > payload and it's really annoying. It would be much easier if we had a > helper that just did it all in one function and that's what > emit_collect tries to do. *Thank you* for finally trying to solve > that problem and make payloads less painful! However, I do wish it > were implemented a bit differently but I'll make those comments on the > patch itself. > Yeah. I think it was Matt who mentioned in the call that emit_collect() is really complex. I completely agree with that statement. It needs to take care of a bunch of things to build the payload correctly: - Allocate a register that will hold the result, calculating the correct total size based on the amount of data we want to concatenate. - Calculate the correct header size. - Allocate and release memory of the correct size to hold the array of LOAD_PAYLOAD sources, also based on the amount of data we want to concatenate. - Split up the arguments to be passed to LOAD_PAYLOAD as sources in scalars of the correct width (which depends on whether the argument is part of the header or not, and what the dispatch width is), and initialize the array. - Create a LOAD_PAYLOAD instruction with the result of the previous steps. Each one of these tasks is non-trivial, repetitive, can conceivably go wrong, and until now duplicated wherever we had to build a payload -- obscuring the actually meaningful work being done. This is why I think that the assertion that emit_collect() makes code difficult to understand is backwards. Thanks to emit_collect() the task of building a payload is tremendously simplified. For emit_collect() to be possible there has to be some way to represent a register+size pair, because it needs to know the size of each argument, and because its result is itself a register+size pair -- The caller cannot do anything useful with a payload of unknown size, unless it re-calculates the size of the result what would defeat much of the purpose of emit_collect(). The biggest benefit from being able to represent a register+size pair as a value is that emit_collect() can easily be composed with other functions (including with itself), so the construction of specific chunks of the payload can easily be decoupled and re-used. My code takes advantage of this feature extensively in order to share code that constructs surface message headers or transforms some value into the form expected by the shared unit (doing vector splitting or a strided copy). > ### SIMD16 Instruction Splitting ### > > SIMD16 instruction splitting is an unfortunate fact of our hardware. > There are a variety of times we have to do it including dual-source FB > writes, some texturing, math ops on older gens and maybe another place > or two. Classically, this has been done in one of two places: The > visitor as we emit the code, or the generator. The problem with doing > it in the generator is that we can't schedule it and, if it involves a > payload, it's not really possible. The result is that we usually do > it in the visitor these days. > > Unfortunately, even in the visitor, it's gen-specific and annoying. > It gets even worse when you're working with something such as the > untyped surface read/write messages that work with multi-component > values that have to be zipped/unzipped to use Curro's terminology. > Curro came up with some helper functions to make it substantially less > annoying but it still involves nasty looping. > > At some point in the past I proposed a completely different and more > solution to this problem. Unfortunately, while I've talked to Matt & > Ken about it, it's never really been discussed all that publicly so > Curro may not be aware of it. I'm going to lay it out here for > Curro's sake as well as the sake of public record. > > The solution involves first changing the way we handle sends into a > two step process. First, we emit a logical instruction that contains > all of the data needed for the actual instruction. Then, we convert > from the logical to the actual in a lowering pass. Take, for example, > FB writes with which I am fairly familiar. We would first emit a > logical FS_FB_WRITE_# instruction that has separate sources for color, > depth, replicated alpha, etc. Then, in the lower_fb_writes pass > (which we would have to implement), we would construct the payload > from the sources provided on the logical instruction and emit the > actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes > function would then get called before the optimization loop so that > the rest of the optimization could would never see it. > > Second, we add a split_instruction helper that would take a SIMD16 > instruction and naively split it into two SIMD8 instructions. Such a > helper really shouldn't be that hard to write. It would have to know > how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that > shouldn't be bad. Any of these new logical send instructions would > have their values as separate sources so they should be safe to split. > > Third, we add a lower_simd16_to_simd8 pass that walks the > instructions, picks out the ones that need splitting, and calls > split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16 > knowledge would be contained in this one pass. This pass would happen > between actually emitting code and running lower_fb_writes (or > whatever other send lowering passes we have). > > Finally, and this is the icing on the cake, we write a > lower_simd32_to_simd16 pass that goes through and lowers all SIMD32 > instructions (that act on full-sized data-types) to SIMD16 > instructions. Once the rest of the work is done, we get this pass, > and with it SIMD32 mode, almost for free. > > I know this approach looks like more work and, to be honest, it may > be. However, I think it makes a lot of things far more > straightforward. In particular, it means that people working on the > visitor code don't have to think about whether or not an instruction > needs splitting. You also don't have to deal with the complexity of > zipping/unzipping sources every time. Instead, we put all that code > in one place and get to stop thinking about it. Also, if we *ever* > want to get SIMD32, we will need some sort of automatic instruction > splitting and this seems like a reasonable way to do it. > > I've talked to Ken about this approach and he's 100% on-board. I > don't remember what Matt thinks of it. If we like the approach, then > we should just split the tasks up and make it happen. It's a bit of > refactoring but it shouldn't be terrible. If we wanted to demo it, I > would probably suggest starting with FB writes as those are fairly > complex but yet self-contained. They also have a case where we do > split an instruction so it would be interesting to see what the code > looks like before and after. > I generally like your proposal. I guess the question we need to answer is whether we want this complexity to be in a lowering pass or in a helper function used to build the send-like instruction -- In either case we need code to handle zipping and unzipping of SIMD16 vectors, it's just about whether this code is called by a lowering pass or higher up in the visitor. I can think of several benefits of the approach you propose over mine: - It's more transparent for the visitor code emitting the message -- I completely agree with you that the explicit loops are rather ugly. - Instructions with explicit separate sources are likely to be more suitable for certain optimization passes. Pull constant loads use a similar approach with an expression-style opcode which is at some point lowered to a load payload and send message. This may not be terribly important at this point because of the optimizations already performed in GLSL IR and NIR and due to the nature of the majority of opcodes that don't support SIMD16, but it still seems appealing. Some disadvantages come to my mind too: - It increases the amount of work required to add a new send-like instruction because you need lowered and unlowered variants for each. - It seems tricky to get right when splitting an instruction in halves involves changing the actual contents of the payload beyond zipping and unzipping its arguments -- This might not seem like a big deal right now, but it will be a problem when we implement SIMD32. The surface messages that take a sample mask as argument are a good example, because they only have 16 bits of space for it so you actually need to provide different values depending on the "slot group" the message is meant for. This can be worked around easily in the visitor by shifting the sample mask register but it seems harder to fix up later. - My intention was to handle vector layout peculiarities beyond SIMDN-to-SIMD8 splitting under a consistent framework, leaving it under the control of the helper function that builds the message which quirks to apply. With the lowering pass approach you'd either need message-specific lowering passes, message-specific code in a general lowering pass, or some way for the builder code to pass vector_layout-like metadata to the lowering pass -- In the first two cases the policy on how to format each source in the payload would be "delocalized" between the visitor and lowering pass, what I don't particularly like. > Thoughts? > > Question for Curro: Supposing for the moment that we decide to do > SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff > would still be needed for image_load_store? > I think I've already answered this question, but I'll answer it again here for clarity -- It would still help with building and passing around payloads concisely using emit_collect(), regardless of how we implement SIMD16 instruction splitting. > I hope this e-mail provides us with some good talking points and a way > forward. > --Jason Thanks Jason! [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev