Jason Ekstrand <ja...@jlekstrand.net> writes: > On Fri, May 15, 2015 at 9:51 AM, Francisco Jerez <curroje...@riseup.net> > wrote: >> Jason Ekstrand <ja...@jlekstrand.net> writes: >> >>> I haven't said much about this series up until now. I've mostly sat >>> and watched and focused my time on other things. As I said in the >>> meeting today, I think that part of the problem here is that there are >>> at least 3 "refactors" in here besides the new feature. By >>> "refactors", I mean "new ways of solving problem X". Part of the >>> problem with the discussion is that all of this has been conflated and >>> we aren't actually talking about how to solve each of those problems. >>> I'm going to try and break it out here to try and aid in the >>> discussion. >>> >> Thanks a lot Jason for giving a more constructive turn to this >> discussion. :) >> >>> 1) The builder. I think *everyone* likes the builder. The argument >>> here is not over whether or not we want it. People have expressed >>> concern that adding the builder now without actually doing the >>> refactor will result in duplicated code that will get out-of-sync. I >>> think the best solution here is to go ahead and do the builder stuff >>> immediately after the branch point. >>> >> Agreed. >> >>> 2) SIMD8 instruction splitting. As I said in the call, we've done >>> this a variety of ways in the past usually by emitting the split >>> instructions in the visitor and manually using half() and >>> inst->force_sechalf. We also have a few places that I think still >>> split the code in the generator. What you're doing is a somewhat more >>> elegant version of the "emit it in the visitor" stuff that we've done >>> in the past. You've got your zip/unzip functions etc. to handle >>> splitting and combining and then you have a loop to actually emit the >>> sends. Really, it's fairly nice and concise. The question is whether >>> or not it's really what we want to do (more on this in a bit). >>> >> There is another closely related problem that this infrastructure solves >> and may not be obvious in this series (it was in my original branch >> though [1]), but I think it's worth keeping in mind, I'll call it 2.1: >> Some of the surface messages lacked SIMD4x2 variants, so a strided copy >> is required to transpose the original AoS vector into SoA form. This >> turns out to be algorithmically identical to the strided copy required >> to "unzip" a SIMD16 vector, so it can be accomplished without additional >> effort. >> >> These transformations are largely transparent for the functions building >> typed/untyped surface messages, they are hidden behind an interface that >> doesn't seem to have received any attention yet: >> >> - struct vector_layout: This represents the layout of a vector in a >> payload. It's initialized based on the restrictions of the hardware >> for the specific message we want to send, and the current code >> generation parameters (i.e. SIMD mode). >> >> - emit_insert(): This takes a vector with the layout native to the EU >> (e.g. an fs_reg) and performs the required transformations to turn it >> into a form suitable for the shared unit according to a vector_layout >> struct (what may involve unzipping, a strided copy or no work at >> all). >> >> - emit_extract(): The converse of emit_insert(). This takes a payload >> fragment (array_reg) and transforms it into a vector of native >> layout, according to a vector_layout struct. >> >> My intention was to come up with an interface more general than SIMD16 >> vector splitting and SIMD4x2 to SIMD8 conversion alone, I was also >> thinking texturing instruction payload quirks (which might require >> padding or different permutations of the same values depending on the >> generation even if the SIMD mode is natively supported -- let's call >> this 2.2) and framebuffer writes (2.3). > > Thank you for pointing out what else you were trying to do with it. I > understand that you're trying to solve a dozen different problems and > that the infrastructure you've created is one that you think solves > all of them at the same time. It helps to know what all problems you > were trying to solve. However, since we are putting SIMD4x2 aside for > now, in the interest of keeping us from getting side-tracked, let's > table this for now. It's a problem that may need solving, but it's a > *different* problem. > If it can be solved with the same approach, it is a closely related problem, isn't it? And wouldn't it be a bit silly to knowingly implement a less general solution when the more general solution involves *less* work and while we know we'll want to address the remaining problems solved by the more general solution eventually? ;)
>>> 3) Array reg. This is something of a helper to (2). >> >> In fact the primary motivation for array_reg wasn't instruction >> splitting, but emit_collect(), more on that later. >> >>> I agree that it's nice to be able to have access to the size of a >>> register without having to look at the allocator. The problem here >>> (and why we don't have a size in fs_reg) is that we copy the reigster >>> around. Every register does have some sort of "underlying storage" >>> that it shares with other registers with the same number. But that >>> storage is represented by the size in the allocator and nothing else. >>> Is that bad? Maybe. Should we have a way of passing the size around >>> with it? Maybe. In any case, I think it's best to table this >>> discussion until we've talked a bit about (2) because I think the >>> resolution there will make (3) more clear. >>> >> I think the mechanism we use to determine the size of a register hardly >> ever involves the allocator, and there are good reasons not to rely on >> it: It only works for VGRFs, can easily break with register coalescing, >> and it makes "slicing" and other zero-copy transformations of registers >> difficult (e.g. taking an individual element out of a VGRF array without >> copying it into a one-component register). Typically the size of a >> register is implicitly assumed by the instruction using it, and where >> it's not (because the instruction expects a variable length payload) it >> is provided separately from the register via "mlen". Neither of these >> two options works here as will be obvious shortly. > > Understood. > >>> One note on 3 before we table it. People have been pulling >>> emit_collect out as a straw-man and beating on it but I really do like >>> it in principle. There are piles of times where you want a payload >>> and you have to allocate an array, do stuff, and then put it into the >>> payload and it's really annoying. It would be much easier if we had a >>> helper that just did it all in one function and that's what >>> emit_collect tries to do. *Thank you* for finally trying to solve >>> that problem and make payloads less painful! However, I do wish it >>> were implemented a bit differently but I'll make those comments on the >>> patch itself. >>> >> >> Yeah. I think it was Matt who mentioned in the call that emit_collect() >> is really complex. I completely agree with that statement. It needs to >> take care of a bunch of things to build the payload correctly: >> >> - Allocate a register that will hold the result, calculating the >> correct total size based on the amount of data we want to >> concatenate. >> >> - Calculate the correct header size. >> >> - Allocate and release memory of the correct size to hold the array of >> LOAD_PAYLOAD sources, also based on the amount of data we want to >> concatenate. >> >> - Split up the arguments to be passed to LOAD_PAYLOAD as sources in >> scalars of the correct width (which depends on whether the argument >> is part of the header or not, and what the dispatch width is), and >> initialize the array. >> >> - Create a LOAD_PAYLOAD instruction with the result of the previous >> steps. >> >> Each one of these tasks is non-trivial, repetitive, can conceivably go >> wrong, and until now duplicated wherever we had to build a payload -- >> obscuring the actually meaningful work being done. This is why I think >> that the assertion that emit_collect() makes code difficult to >> understand is backwards. Thanks to emit_collect() the task of building >> a payload is tremendously simplified. > > Yes. I really don't think people are disagreeing with you too badly > about emit_collect(). I also don't think it's as complicated as you > make it sound. > Heh, maybe "tremendously" wasn't the word I was looking for, but it's appreciably simplified. :P >> For emit_collect() to be possible there has to be some way to represent >> a register+size pair, because it needs to know the size of each >> argument, and because its result is itself a register+size pair -- The >> caller cannot do anything useful with a payload of unknown size, unless >> it re-calculates the size of the result what would defeat much of the >> purpose of emit_collect(). >> >> The biggest benefit from being able to represent a register+size pair as >> a value is that emit_collect() can easily be composed with other >> functions (including with itself), so the construction of specific >> chunks of the payload can easily be decoupled and re-used. My code >> takes advantage of this feature extensively in order to share code that >> constructs surface message headers or transforms some value into the >> form expected by the shared unit (doing vector splitting or a strided >> copy). > > Yes, I understand that it's useful for that. Hence the suggestion of > either a reg/size pair struct or passing the size as an argument. The > struct has the advantage, as you said, of allowing chaining. > >>> ### SIMD16 Instruction Splitting ### >>> >>> SIMD16 instruction splitting is an unfortunate fact of our hardware. >>> There are a variety of times we have to do it including dual-source FB >>> writes, some texturing, math ops on older gens and maybe another place >>> or two. Classically, this has been done in one of two places: The >>> visitor as we emit the code, or the generator. The problem with doing >>> it in the generator is that we can't schedule it and, if it involves a >>> payload, it's not really possible. The result is that we usually do >>> it in the visitor these days. >>> >>> Unfortunately, even in the visitor, it's gen-specific and annoying. >>> It gets even worse when you're working with something such as the >>> untyped surface read/write messages that work with multi-component >>> values that have to be zipped/unzipped to use Curro's terminology. >>> Curro came up with some helper functions to make it substantially less >>> annoying but it still involves nasty looping. >>> >>> At some point in the past I proposed a completely different and more >>> solution to this problem. Unfortunately, while I've talked to Matt & >>> Ken about it, it's never really been discussed all that publicly so >>> Curro may not be aware of it. I'm going to lay it out here for >>> Curro's sake as well as the sake of public record. >>> >>> The solution involves first changing the way we handle sends into a >>> two step process. First, we emit a logical instruction that contains >>> all of the data needed for the actual instruction. Then, we convert >>> from the logical to the actual in a lowering pass. Take, for example, >>> FB writes with which I am fairly familiar. We would first emit a >>> logical FS_FB_WRITE_# instruction that has separate sources for color, >>> depth, replicated alpha, etc. Then, in the lower_fb_writes pass >>> (which we would have to implement), we would construct the payload >>> from the sources provided on the logical instruction and emit the >>> actual LOAD_PAYLOAD and FB_WRITE instructions. This lower_fb_writes >>> function would then get called before the optimization loop so that >>> the rest of the optimization could would never see it. >>> >>> Second, we add a split_instruction helper that would take a SIMD16 >>> instruction and naively split it into two SIMD8 instructions. Such a >>> helper really shouldn't be that hard to write. It would have to know >>> how to take a SIMD16 vec4 and unzip it into two SIMD8 vec4's but that >>> shouldn't be bad. Any of these new logical send instructions would >>> have their values as separate sources so they should be safe to split. >>> >>> Third, we add a lower_simd16_to_simd8 pass that walks the >>> instructions, picks out the ones that need splitting, and calls >>> split_instruction on them. All of the gen-specific SIMD8 vs. SIMD16 >>> knowledge would be contained in this one pass. This pass would happen >>> between actually emitting code and running lower_fb_writes (or >>> whatever other send lowering passes we have). >>> >>> Finally, and this is the icing on the cake, we write a >>> lower_simd32_to_simd16 pass that goes through and lowers all SIMD32 >>> instructions (that act on full-sized data-types) to SIMD16 >>> instructions. Once the rest of the work is done, we get this pass, >>> and with it SIMD32 mode, almost for free. >>> >>> I know this approach looks like more work and, to be honest, it may >>> be. However, I think it makes a lot of things far more >>> straightforward. In particular, it means that people working on the >>> visitor code don't have to think about whether or not an instruction >>> needs splitting. You also don't have to deal with the complexity of >>> zipping/unzipping sources every time. Instead, we put all that code >>> in one place and get to stop thinking about it. Also, if we *ever* >>> want to get SIMD32, we will need some sort of automatic instruction >>> splitting and this seems like a reasonable way to do it. >>> >>> I've talked to Ken about this approach and he's 100% on-board. I >>> don't remember what Matt thinks of it. If we like the approach, then >>> we should just split the tasks up and make it happen. It's a bit of >>> refactoring but it shouldn't be terrible. If we wanted to demo it, I >>> would probably suggest starting with FB writes as those are fairly >>> complex but yet self-contained. They also have a case where we do >>> split an instruction so it would be interesting to see what the code >>> looks like before and after. >>> >> >> I generally like your proposal. I guess the question we need to answer >> is whether we want this complexity to be in a lowering pass or in a >> helper function used to build the send-like instruction -- In either >> case we need code to handle zipping and unzipping of SIMD16 vectors, >> it's just about whether this code is called by a lowering pass or higher >> up in the visitor. >> >> I can think of several benefits of the approach you propose over mine: >> >> - It's more transparent for the visitor code emitting the message -- I >> completely agree with you that the explicit loops are rather ugly. >> >> - Instructions with explicit separate sources are likely to be more >> suitable for certain optimization passes. Pull constant loads use a >> similar approach with an expression-style opcode which is at some >> point lowered to a load payload and send message. This may not be >> terribly important at this point because of the optimizations already >> performed in GLSL IR and NIR and due to the nature of the majority of >> opcodes that don't support SIMD16, but it still seems appealing. > > There's one more really big one that you missed: > > It scales! We can't afford to have a for loop for every ADD and MUL > instruction. Sure, we might be able to afford it on sends, but not > for everything. > Well, I doubt you'd want to implement your proposal to the letter for non-send instructions: For those we don't need separate lowered and non-lowered instructions because there's no payload to assemble, so we can just do with a single opcode with different execution widths. When you take payloads out of the picture all the disadvantages mentioned of the lowering pass approach no longer apply, so I totally agree with you that we want a general lowering pass to lower instructions that expect their arguments as separate sources in their final form. In any case send-like instructions need a somewhat different (and less scalable) treatment. >> Some disadvantages come to my mind too: >> >> - It increases the amount of work required to add a new send-like >> instruction because you need lowered and unlowered variants for each. >> >> - It seems tricky to get right when splitting an instruction in halves >> involves changing the actual contents of the payload beyond zipping >> and unzipping its arguments -- This might not seem like a big deal >> right now, but it will be a problem when we implement SIMD32. The >> surface messages that take a sample mask as argument are a good >> example, because they only have 16 bits of space for it so you >> actually need to provide different values depending on the "slot >> group" the message is meant for. This can be worked around easily in >> the visitor by shifting the sample mask register but it seems harder >> to fix up later. > > Why do sample masks need to be part of the logical instruction? Can't > we figure that out when we lower from logical to physical based on the > quarter control? > You can surely do anything you want during the logical-to-physical conversion, including rewriting the header, the problem is that it that forces you to have a pile of message-specific handling code in the lowering pass. How are you planning to address that? With a separate lowering pass for each message opcode or a general one with message-specific knowledge? >> - My intention was to handle vector layout peculiarities beyond >> SIMDN-to-SIMD8 splitting under a consistent framework, leaving it >> under the control of the helper function that builds the message >> which quirks to apply. With the lowering pass approach you'd either >> need message-specific lowering passes, message-specific code in a >> general lowering pass, or some way for the builder code to pass >> vector_layout-like metadata to the lowering pass -- In the first two >> cases the policy on how to format each source in the payload would be >> "delocalized" between the visitor and lowering pass, what I don't >> particularly like. > > We already do that and we would continue to. The logical instruction > wouldn't be much more than a serialized version of the helper > function. > > Thanks for your reply. > --Jason > >>> Thoughts? >>> >>> Question for Curro: Supposing for the moment that we decide to do >>> SIMD16 -> SIMD8 splitting this way, how much of the array_reg stuff >>> would still be needed for image_load_store? >>> >> I think I've already answered this question, but I'll answer it again >> here for clarity -- It would still help with building and passing around >> payloads concisely using emit_collect(), regardless of how we implement >> SIMD16 instruction splitting. >> >>> I hope this e-mail provides us with some good talking points and a way >>> forward. >>> --Jason >> >> Thanks Jason! >> >> [1] http://cgit.freedesktop.org/~currojerez/mesa/log/?h=image-load-store
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev