On Mon, Apr 9, 2018 at 5:35 AM, Rob Clark <robdcl...@gmail.com> wrote:
> On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <ja...@jlekstrand.net> > wrote: > > Rather lively discussion we've got going here... > > > > On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdcl...@gmail.com> wrote: > >> > >> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen > >> <b...@basnieuwenhuizen.nl> wrote: > >> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdcl...@gmail.com> > wrote: > >> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen > >> >> <b...@basnieuwenhuizen.nl> wrote: > >> >>>>>>>>>>> + > >> >>>>>>>>>>> + /** The mode of the underlying variable */ > >> >>>>>>>>>>> + nir_variable_mode mode; > >> >>>>>>>>>> > >> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of > >> >>>>>>>>>> nir_print and > >> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode > from > >> >>>>>>>>>> the > >> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode has > >> >>>>>>>>>> no > >> >>>>>>>>>> meaning for physical addressing (where the mode comes from > the > >> >>>>>>>>>> pointer). > >> >>>>>>>>>> > >> >>>>>>>>>> So maybe just drop deref->mode? > >> >>>>>>>>> > >> >>>>>>>>> Isn't it still useful with logical addressing in case a var is > >> >>>>>>>>> not > >> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers) > > > > > > Yes, the idea here is to basically be the NIR equivalent of storage > classes. > > For cases where you can chase it all the way back to the variable, this > is > > pointless. For cases where you can't, this can be very useful. > > > >> > >> >>>>>>>> not sure, maybe this should just also use fat-pointers like > >> >>>>>>>> physical > >> >>>>>>>> addressing does?? > > > > > > I *think* (without knowing the future with perfect accuracy) that I'd > like > > us to start moving away from fat pointers in SPIR-V -> NIR. They've been > > working ok thus far but we already have had complaints from the radv guys > > that they don't want fat pointers for SLM and I'm not sure there that > great > > of a fit for other things. The big reason is that it's actually fairly > > painful to change from one fat pointer scheme to another if whatever the > > spirv_to_nir authors picked doesn't really fit your hardware. > > > > I'm not 100% sure what to do but the idea I've got at the moment is > > something like this: > > > > 1) For every mode (or storage class), the driver provides to > spirv_to_nir a > > glsl_type which is the type it wants the pointer represented as. > > 2) Add a deref_to_pointer intrinsic to convert the deref to that type > and > > use casts to convert the fat pointer to a deref again > > > > Why the deref_to_pointer intrinsic? Because right now some annoying > details > > about nir_intrinsic_info force us to have all drefs have a single > component. > > If we didn't, then nir_intrinsic_store_var would have two variable-size > > sources which aren't the same size. We could modify the rules and have a > > concept of a "deref source" and to nir_intrinsic_info and say that a > deref > > source can be any size. That would also work and may actually be a > better > > plan. I'm open to other suggestions as well. > > > > The key point, however, is (1) where we just let the driver choose the > > storage type of pointers and then they can have whatever lowering pass > they > > want. If that's a "standard" fat-pointers pass in nir_lower_io, so be > it. > > If they want to translate directly into LLVM derefs, they can do that > too, I > > suppose. > > > >> >>>>>>>>> Also I could see this being useful in physical addressing too > to > >> >>>>>>>>> avoid > >> >>>>>>>>> all passes working with derefs needing to do the constant > >> >>>>>>>>> folding? > > > > > > Constant folding is cheap, so I'm not too worried about that. The bigger > > thing that actual derefs gain us is the ability of the compiler to reason > > about derefs. Suppose, for instance, that you had the following: > > > > struct S { > > float a; > > float b; > > }; > > > > location(set = 0, binding = 0) Block { > > S arr[10]; > > } foo; > > > > void > > my_func(int i) > > { > > foo.arr[i].a = make_a_value(); > > foo.arr[i].b = make_another_value(); > > use_value(foo.arr[i].a); > > } > > > > If everything is lowered to fat pointers, the compiler can't very easily > > tell that the second line of my_func() didn't stomp foo.arr[i].a and so > it > > has to re-load the value in the third line. With derefs, we can easily > see > > that the second line didn't stomp it and just re-use the re-use the > result > > of the make_a_value() call. Yes, you may be able to tell the difference > > between foo.arr[i].a and foo.arr[i].b with some fancy analysis and > > arithmetic tracking but it'd be very painful to do. I've given this > > substantial thought for almost two years now and haven't actually come up > > with a good way to do it without the information provided by derefs. > > > > The biggest thing that I think we will gain from deref instructions isn't > > fancy syntax sugar and better nir_printability of pointers. It also > isn't > > the removal of explicit of fat pointers in spirv_to_nir (thought that is > > also nice). The biggest thing I'm going for is improving NIR's ability > to > > optimize UBO, SSBO, and SLM access. We're pretty good today for local > > variables (though there are a couple of known places for improvements) > but > > we're utterly rubbish for anything that leaves the current shader > > invocation. If we want competent compute performance (which we do!) we > need > > to solve that problem. > > > >> > >> >>>>>>>> The problem is that you don't necessarily know the type at > >> >>>>>>>> compile > >> >>>>>>>> time (and in the case where you do, you need to do constant > >> >>>>>>>> folding to > >> >>>>>>>> figure it out) > >> >>>>>>> > >> >>>>>>> So I have two considerations here > >> >>>>>>> > >> >>>>>>> 1) for vulkan you always know the mode, even when you don't know > >> >>>>>>> the var. > >> >>>>>>> 2) In CL the mode can still get annotated in the source program > >> >>>>>>> (CL C > >> >>>>>>> non-generic pointers) in cases in which we cannot reasonably > >> >>>>>>> figure it > >> >>>>>>> out with just constant folding. In those cases the mode is extra > >> >>>>>>> information that you really lose. > >> >>>>>> > >> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ? > >> >>>>>> global_ptr > >> >>>>>> : local_ptr)'.. depending on how much we inline all the things, > >> >>>>>> that > >> >>>>>> might not get CF'd away. > >> >>> > >> >>> How does this even work btw? somefxn has a definition, and the > >> >>> definition specifies a mode for the argument right? (which is > >> >>> implicitly __private if the app does not specify anything?) > >> >> > >> >> iirc, the cl spec has an example something along these lines.. > >> >> > >> >> it doesn't require *physical* storage for anything where you don't > >> >> know what the ptr type is, however.. so fat ptrs in ssa space works > >> >> out > >> >> > >> >>>>> > >> >>>>> But something like > >> >>>>> __constant int *ptr_value = ...; > >> >>>>> store ptr in complex data structure. > >> >>>>> __constant int* ptr2 = load from complex data structure. > >> >>>>> > >> >>>>> Without explicitly annotating ptr2 it is unlikely that constant > >> >>>>> folding would find that ptr2 is pointing to __constant address > >> >>>>> space. > >> >>>>> Hence removing the modes loses valuable information that you > cannot > >> >>>>> get back by constant folding. However, if you have a pointer with > >> >>>>> unknown mode, we could have a special mode (or mode_all?) and you > >> >>>>> can > >> >>>>> use the uvec2 representation in that case? > > > > > > I've thought about using a mode of 0 for this or maybe letting you OR all > > the possible modes together since nir_variable_mode is, after all, a > > bitfield. I don't have any huge preference at the moment. > > > >> > >> >>>> hmm, I'm not really getting how deref->mode could magically have > >> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn > >> >>>> could stash it in fatptr.y and everyone is happy? If vtn doesn't > >> >>>> know > >> >>>> this, then I don't see how deref->mode helps.. > >> >>> > >> >>> You mean insert it into the fatptr every time deref_cast is called? > >> >>> > >> >>> Wouldn't that blow up the IR size significantly for very little > >> >>> benefit? > >> >> > >> >> in an easy to clean up way, so meh? > >> > > >> > We can't clean it up if we want to keep the information. Also nir is > >> > pretty slow to compile already, so I'd like not to add a significant > >> > number of instruction for very little benefit. > > > > > > Really? I mean, I can believe it, but do you have any actual numbers to > > back this up? It's considerably faster than some other IRs we have in > mesa > > though they are known to be pretty big pigs if we're honest. I'm very > open > > to any suggestions on how to improve compile times. If NIR is > fundamentally > > a pig, we should fix that. > > > > I don't think compile time should be the deciding factor in whether or > not > > we use fat pointers as I doubt it will have a huge effect. However, how > > much work it takes to get at information is a reasonable argument. > > > >> > >> I guess I'm failing to see what information you'd loose.. or how there > >> would be a problem.. > >> > >> I'm not strictly against having nir_var_unknown as a possible value > >> for deref->mode, but I'm not really seeing how that helps anything, > >> other than adding extra complexity to the IR.. > >> > >> > >> >>>>>> > >> >>>>>> I think I'm leaning towards using fat ptrs for the vk case, > since I > >> >>>>>> guess that is a case where you could always expect > >> >>>>>> nir_src_as_const_value() to work, to get the variable mode. If > for > >> >>>>>> no > >> >>>>>> other reason than I guess these deref's, if the var is not known, > >> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have > to > >> >>>>>> work differently for compute vs vk. But maybe Jason already has > >> >>>>>> some > >> >>>>>> thoughts about it? > > > > > > I don't see why they would be different. Vulkan and CL are starting to > > converge. In vulkan, you already can't always chase the pointer back to > the > > variable VK_KHR_variable_pointers is used. I don't think that CL is as > > special as you think it is. The only thing special about CL is the fact > > that global and local are in the same address space. Beyond that, all > the > > problems we have to solve are fundamentally the same. > > > >> > >> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would > not > >> >>>>> use it even for CL. a generic pointer is just a uint64_t for us, > >> >>>>> with > >> >>>>> no bitfield in there for the address space. > > > > > > I tend to agree with this. (See above discussion). > > > >> > >> >>>>> I think we may need to think a bit more about representation > >> >>>>> however, > >> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do e.g. > >> >>>>> 32-bits for known workgroup pointers), the current deref > >> >>>>> instructions > >> >>>>> return 32-bit, and you want something like a uvec2 as pointer > >> >>>>> representation? > >> >>>> > >> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a single > >> >>>> global address space.. But I guess that is an easy subset of the > >> >>>> harder case where drivers need to use different instructions.. so a > >> >>>> pretty simple lowering pass run before lower_io could remap things > >> >>>> that use fatptrs into something that ignores fatptr.y. Then opt > >> >>>> passes make fatptr.y go away. So both AMD and hw that doesn't > have a > >> >>>> flat address space are happy. > >> >>> > >> >>> But then you run into other issues, like how are you going to stuff > a > >> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for > Physical64 > >> >>> addressing? Also this means we have to track to the sources back to > >> >>> the cast/var any time we want to do anything at all with any deref > >> >>> which seems less efficient to me than just stuffing the deref in > >> >>> there. > >> >> > >> >> so fat ptrs only have to exist in ssa space, not be stored to > >> >> something with a physically defined size.. > >> > > >> > how does storing __generic pointers work then? those still need the > >> > fat bit for your hw right? > >> > >> for hw that can't map everything into a single flat address space, you > >> *can't* store a fat pointer.. oh well, it doesn't mean that that hw > >> can't implement lesser ocl versions (since this is defn not required > >> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't > >> checked the spec.. I guess if intel supports 2.x then it isn't > >> required..) > > > > > > You can still carve up the address space and put local memory at some > > absurdly high address and do > > > > if (ptr > 0xffffffffffff0000) > > store_local(ptr & 0xffff) > > else > > store_global(ptr) > > > > That might be an option if we need to physically store a fat pointer > (but I'm not convinced we need to). But as a general solution it is a > horrible idea. You either loose information (since when was TMI a > problem for compilers?) or you get to teach constant folding that even > though it doesn't know the value of 'a' it does know the value of 'a & > 0xffffffffffff0000', which is insanity. > > And simply not optimizing away the global/local/private turns every > load/store into something that looks like: > > cmps.f.lt p0.x, c4.x, r0.z > ; delay 6 instructions so branch can see the value in p0.x > (rpt5)nop > br !p0.x, #3 > ldg.u32 r0.w, r0.y, 1 > jump #11 > (jp)cmps.f.lt p0.x, c4.y, r0.z > (rpt5)nop ; delay 6 instruction slots > br !p0.x, #8 > ldl.u32 r0.w, r0.y, 1 > jump #3 > ; private is really just global with a driver provided buffer > ; and compiler computed offset > (jp)add.s r0.y, c10.x ; add base address of buffer > (rpt3)nop ; alu results not immediately avail and no > other instr to fill the empty slots > add.s r0.y, ... offset calculated from local_id > (rpt5)nop ; 6 slots for result from alu avail to mem > instructions > ldg.u32 r0.w, r0.y, 1 > (jp)(sy)... and now we have a result.. > > > which is simply not an option! And the result is actually worse since > we end up with 64b pointer math lowered to 32b instructions! > > tbh, I *really* don't see the problem with fat pointers, but if > someone else doesn't want deref_cast to use fat pointers, then I must > insist on a different intrinsic which does, and a compiler option for > vtn to choose which to emit. > I'm sorry. That comment was NOT intended as a suggestion for universal alternative to fat pointers. It was merely to demonstrate how an architecture like Intel's which uses different messages for local vs. global could implement physical 64-bit (uint64_t) generic pointers. Someone (looks like Rob from the indentation) made a comment about difficulties implementing OpenCL 2.x and I was just trying to show that it could be done. Of course, stuffing the extra information in a vector component is easier. This was a side comment to what I thought was a side comment in the discussion. > >> > >> >> > >> >> As far as tracking things to the head of the chain of deref > >> >> instructions, that is a matter of a simple helper or two. Not like > >> >> the chain of deref's is going to be 1000's of instructions.. > > > > > > Yes and no (mostly no) once you start throwing in phis, selects, and > loading > > pointers from variables. The first two you can explain away by just > saying, > > "follow the phi/select sources" but the moment you load a pointer out of > a > > variable, you have to do something. Sure, you could just drop .y and > stomp > > it to the known storage class, but that hardly seems ideal. > > > >> > >> >>> Also, what would the something which ignores fatptr.y be? I'd assume > >> >>> that would be the normal deref based stuff, but requiring fatptr > >> >>> contradicts that? > >> >> > >> >> if you have a flat address space, maybe a pass (or option for > >> >> lower_io) to just convert everything to load/store_global (since > >> >> essentially what these GPUs are doing is remapping shared/private > into > >> >> the global address space) > >> > > >> > but I'd like to only do that if we really don't know the mode > >> > statically since it is somewhat slower/less flexible. (also radv is > >> > unlikely to use nir_lower_io for a lot of this stuff since we can > >> > lower derefs into LLVM GEPs directly) > >> > >> I mean, we control all the src code, we can add more options to > lower_io. > >> > >> > > >> > Hence if we want the cases where we know the mode statically we need > >> > to not lower the fatptr, but then we have the whole fatptr mess. > >> > > >> > >> well, fatptr mess is unavoidable.. I mostly just fail to see how a > >> more general solution (ie. storing the variable mode in ssa) is worse > >> than having to deal with both cases (in ssa or in instr). The in ssa > >> case is something that is easy to recover since anything that does > >> pointer math has to split out the .y component and fuse it back in > >> after the pointer math. (And if you have a flat address space, I > >> guess I'm failing to see why you care about the mode in the first > >> place.) > > > > > > Yes, fat pointers are inevitable for some hardware at some stage in the > > compile. That does not, however, mean that we want fat pointers straight > > out of SPIR-V. >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev