On Mon, Apr 9, 2018 at 10:25 AM, Jason Ekstrand <ja...@jlekstrand.net> wrote: > On Mon, Apr 9, 2018 at 5:35 AM, Rob Clark <robdcl...@gmail.com> wrote: >> >> On Mon, Apr 9, 2018 at 1:35 AM, Jason Ekstrand <ja...@jlekstrand.net> >> wrote: >> > Rather lively discussion we've got going here... >> > >> > On Sun, Apr 8, 2018 at 3:23 PM, Rob Clark <robdcl...@gmail.com> wrote: >> >> >> >> On Sun, Apr 8, 2018 at 5:54 PM, Bas Nieuwenhuizen >> >> <b...@basnieuwenhuizen.nl> wrote: >> >> > On Sun, Apr 8, 2018 at 11:40 PM, Rob Clark <robdcl...@gmail.com> >> >> > wrote: >> >> >> On Sun, Apr 8, 2018 at 5:20 PM, Bas Nieuwenhuizen >> >> >> <b...@basnieuwenhuizen.nl> wrote: >> >> >>>>>>>>>>> + >> >> >>>>>>>>>>> + /** The mode of the underlying variable */ >> >> >>>>>>>>>>> + nir_variable_mode mode; >> >> >>>>>>>>>> >> >> >>>>>>>>>> In fact, it seems like deref->mode is unused outside of >> >> >>>>>>>>>> nir_print and >> >> >>>>>>>>>> nir_validate.. for logical addressing we can get the mode >> >> >>>>>>>>>> from >> >> >>>>>>>>>> the >> >> >>>>>>>>>> deref_var->var at the start of the chain, and deref->mode >> >> >>>>>>>>>> has >> >> >>>>>>>>>> no >> >> >>>>>>>>>> meaning for physical addressing (where the mode comes from >> >> >>>>>>>>>> the >> >> >>>>>>>>>> pointer). >> >> >>>>>>>>>> >> >> >>>>>>>>>> So maybe just drop deref->mode? >> >> >>>>>>>>> >> >> >>>>>>>>> Isn't it still useful with logical addressing in case a var >> >> >>>>>>>>> is >> >> >>>>>>>>> not >> >> >>>>>>>>> immediately available? (think VK_KHR_variable_pointers) >> > >> > >> > Yes, the idea here is to basically be the NIR equivalent of storage >> > classes. >> > For cases where you can chase it all the way back to the variable, this >> > is >> > pointless. For cases where you can't, this can be very useful. >> > >> >> >> >> >>>>>>>> not sure, maybe this should just also use fat-pointers like >> >> >>>>>>>> physical >> >> >>>>>>>> addressing does?? >> > >> > >> > I *think* (without knowing the future with perfect accuracy) that I'd >> > like >> > us to start moving away from fat pointers in SPIR-V -> NIR. They've >> > been >> > working ok thus far but we already have had complaints from the radv >> > guys >> > that they don't want fat pointers for SLM and I'm not sure there that >> > great >> > of a fit for other things. The big reason is that it's actually fairly >> > painful to change from one fat pointer scheme to another if whatever the >> > spirv_to_nir authors picked doesn't really fit your hardware. >> > >> > I'm not 100% sure what to do but the idea I've got at the moment is >> > something like this: >> > >> > 1) For every mode (or storage class), the driver provides to >> > spirv_to_nir a >> > glsl_type which is the type it wants the pointer represented as. >> > 2) Add a deref_to_pointer intrinsic to convert the deref to that type >> > and >> > use casts to convert the fat pointer to a deref again >> > >> > Why the deref_to_pointer intrinsic? Because right now some annoying >> > details >> > about nir_intrinsic_info force us to have all drefs have a single >> > component. >> > If we didn't, then nir_intrinsic_store_var would have two variable-size >> > sources which aren't the same size. We could modify the rules and have >> > a >> > concept of a "deref source" and to nir_intrinsic_info and say that a >> > deref >> > source can be any size. That would also work and may actually be a >> > better >> > plan. I'm open to other suggestions as well. >> > >> > The key point, however, is (1) where we just let the driver choose the >> > storage type of pointers and then they can have whatever lowering pass >> > they >> > want. If that's a "standard" fat-pointers pass in nir_lower_io, so be >> > it. >> > If they want to translate directly into LLVM derefs, they can do that >> > too, I >> > suppose. >> > >> >> >>>>>>>>> Also I could see this being useful in physical addressing too >> >> >>>>>>>>> to >> >> >>>>>>>>> avoid >> >> >>>>>>>>> all passes working with derefs needing to do the constant >> >> >>>>>>>>> folding? >> > >> > >> > Constant folding is cheap, so I'm not too worried about that. The >> > bigger >> > thing that actual derefs gain us is the ability of the compiler to >> > reason >> > about derefs. Suppose, for instance, that you had the following: >> > >> > struct S { >> > float a; >> > float b; >> > }; >> > >> > location(set = 0, binding = 0) Block { >> > S arr[10]; >> > } foo; >> > >> > void >> > my_func(int i) >> > { >> > foo.arr[i].a = make_a_value(); >> > foo.arr[i].b = make_another_value(); >> > use_value(foo.arr[i].a); >> > } >> > >> > If everything is lowered to fat pointers, the compiler can't very easily >> > tell that the second line of my_func() didn't stomp foo.arr[i].a and so >> > it >> > has to re-load the value in the third line. With derefs, we can easily >> > see >> > that the second line didn't stomp it and just re-use the re-use the >> > result >> > of the make_a_value() call. Yes, you may be able to tell the difference >> > between foo.arr[i].a and foo.arr[i].b with some fancy analysis and >> > arithmetic tracking but it'd be very painful to do. I've given this >> > substantial thought for almost two years now and haven't actually come >> > up >> > with a good way to do it without the information provided by derefs. >> > >> > The biggest thing that I think we will gain from deref instructions >> > isn't >> > fancy syntax sugar and better nir_printability of pointers. It also >> > isn't >> > the removal of explicit of fat pointers in spirv_to_nir (thought that is >> > also nice). The biggest thing I'm going for is improving NIR's ability >> > to >> > optimize UBO, SSBO, and SLM access. We're pretty good today for local >> > variables (though there are a couple of known places for improvements) >> > but >> > we're utterly rubbish for anything that leaves the current shader >> > invocation. If we want competent compute performance (which we do!) we >> > need >> > to solve that problem. >> > >> >> >> >> >>>>>>>> The problem is that you don't necessarily know the type at >> >> >>>>>>>> compile >> >> >>>>>>>> time (and in the case where you do, you need to do constant >> >> >>>>>>>> folding to >> >> >>>>>>>> figure it out) >> >> >>>>>>> >> >> >>>>>>> So I have two considerations here >> >> >>>>>>> >> >> >>>>>>> 1) for vulkan you always know the mode, even when you don't >> >> >>>>>>> know >> >> >>>>>>> the var. >> >> >>>>>>> 2) In CL the mode can still get annotated in the source >> >> >>>>>>> program >> >> >>>>>>> (CL C >> >> >>>>>>> non-generic pointers) in cases in which we cannot reasonably >> >> >>>>>>> figure it >> >> >>>>>>> out with just constant folding. In those cases the mode is >> >> >>>>>>> extra >> >> >>>>>>> information that you really lose. >> >> >>>>>> >> >> >>>>>> so, even in cl 1.x, you could do things like 'somefxn(foo ? >> >> >>>>>> global_ptr >> >> >>>>>> : local_ptr)'.. depending on how much we inline all the things, >> >> >>>>>> that >> >> >>>>>> might not get CF'd away. >> >> >>> >> >> >>> How does this even work btw? somefxn has a definition, and the >> >> >>> definition specifies a mode for the argument right? (which is >> >> >>> implicitly __private if the app does not specify anything?) >> >> >> >> >> >> iirc, the cl spec has an example something along these lines.. >> >> >> >> >> >> it doesn't require *physical* storage for anything where you don't >> >> >> know what the ptr type is, however.. so fat ptrs in ssa space works >> >> >> out >> >> >> >> >> >>>>> >> >> >>>>> But something like >> >> >>>>> __constant int *ptr_value = ...; >> >> >>>>> store ptr in complex data structure. >> >> >>>>> __constant int* ptr2 = load from complex data structure. >> >> >>>>> >> >> >>>>> Without explicitly annotating ptr2 it is unlikely that constant >> >> >>>>> folding would find that ptr2 is pointing to __constant address >> >> >>>>> space. >> >> >>>>> Hence removing the modes loses valuable information that you >> >> >>>>> cannot >> >> >>>>> get back by constant folding. However, if you have a pointer with >> >> >>>>> unknown mode, we could have a special mode (or mode_all?) and you >> >> >>>>> can >> >> >>>>> use the uvec2 representation in that case? >> > >> > >> > I've thought about using a mode of 0 for this or maybe letting you OR >> > all >> > the possible modes together since nir_variable_mode is, after all, a >> > bitfield. I don't have any huge preference at the moment. >> > >> >> >> >> >>>> hmm, I'm not really getting how deref->mode could magically have >> >> >>>> information that fatptr.y doesn't have.. if the mode is known, vtn >> >> >>>> could stash it in fatptr.y and everyone is happy? If vtn doesn't >> >> >>>> know >> >> >>>> this, then I don't see how deref->mode helps.. >> >> >>> >> >> >>> You mean insert it into the fatptr every time deref_cast is called? >> >> >>> >> >> >>> Wouldn't that blow up the IR size significantly for very little >> >> >>> benefit? >> >> >> >> >> >> in an easy to clean up way, so meh? >> >> > >> >> > We can't clean it up if we want to keep the information. Also nir is >> >> > pretty slow to compile already, so I'd like not to add a significant >> >> > number of instruction for very little benefit. >> > >> > >> > Really? I mean, I can believe it, but do you have any actual numbers to >> > back this up? It's considerably faster than some other IRs we have in >> > mesa >> > though they are known to be pretty big pigs if we're honest. I'm very >> > open >> > to any suggestions on how to improve compile times. If NIR is >> > fundamentally >> > a pig, we should fix that. >> > >> > I don't think compile time should be the deciding factor in whether or >> > not >> > we use fat pointers as I doubt it will have a huge effect. However, how >> > much work it takes to get at information is a reasonable argument. >> > >> >> >> >> I guess I'm failing to see what information you'd loose.. or how there >> >> would be a problem.. >> >> >> >> I'm not strictly against having nir_var_unknown as a possible value >> >> for deref->mode, but I'm not really seeing how that helps anything, >> >> other than adding extra complexity to the IR.. >> >> >> >> >> >> >>>>>> >> >> >>>>>> I think I'm leaning towards using fat ptrs for the vk case, >> >> >>>>>> since I >> >> >>>>>> guess that is a case where you could always expect >> >> >>>>>> nir_src_as_const_value() to work, to get the variable mode. If >> >> >>>>>> for >> >> >>>>>> no >> >> >>>>>> other reason than I guess these deref's, if the var is not >> >> >>>>>> known, >> >> >>>>>> start w/ deref_cast, and it would be ugly for deref_cast to have >> >> >>>>>> to >> >> >>>>>> work differently for compute vs vk. But maybe Jason already has >> >> >>>>>> some >> >> >>>>>> thoughts about it? >> > >> > >> > I don't see why they would be different. Vulkan and CL are starting to >> > converge. In vulkan, you already can't always chase the pointer back to >> > the >> > variable VK_KHR_variable_pointers is used. I don't think that CL is as >> > special as you think it is. The only thing special about CL is the fact >> > that global and local are in the same address space. Beyond that, all >> > the >> > problems we have to solve are fundamentally the same. >> > >> >> >> >> >>>>> I'd like to avoid fat pointers alltogether on AMD since we would >> >> >>>>> not >> >> >>>>> use it even for CL. a generic pointer is just a uint64_t for us, >> >> >>>>> with >> >> >>>>> no bitfield in there for the address space. >> > >> > >> > I tend to agree with this. (See above discussion). >> > >> >> >> >> >>>>> I think we may need to think a bit more about representation >> >> >>>>> however, >> >> >>>>> as e.g. for AMD a pointer is typically 64-bits (but we can do >> >> >>>>> e.g. >> >> >>>>> 32-bits for known workgroup pointers), the current deref >> >> >>>>> instructions >> >> >>>>> return 32-bit, and you want something like a uvec2 as pointer >> >> >>>>> representation? >> >> >>>> >> >> >>>> afaiu, newer AMD (and NV) hw can remap shared/private into a >> >> >>>> single >> >> >>>> global address space.. But I guess that is an easy subset of the >> >> >>>> harder case where drivers need to use different instructions.. so >> >> >>>> a >> >> >>>> pretty simple lowering pass run before lower_io could remap things >> >> >>>> that use fatptrs into something that ignores fatptr.y. Then opt >> >> >>>> passes make fatptr.y go away. So both AMD and hw that doesn't >> >> >>>> have a >> >> >>>> flat address space are happy. >> >> >>> >> >> >>> But then you run into other issues, like how are you going to stuff >> >> >>> a >> >> >>> 64-bit fatptr.x + a ?-bit fatptr.y into a 64-bit value for >> >> >>> Physical64 >> >> >>> addressing? Also this means we have to track to the sources back to >> >> >>> the cast/var any time we want to do anything at all with any deref >> >> >>> which seems less efficient to me than just stuffing the deref in >> >> >>> there. >> >> >> >> >> >> so fat ptrs only have to exist in ssa space, not be stored to >> >> >> something with a physically defined size.. >> >> > >> >> > how does storing __generic pointers work then? those still need the >> >> > fat bit for your hw right? >> >> >> >> for hw that can't map everything into a single flat address space, you >> >> *can't* store a fat pointer.. oh well, it doesn't mean that that hw >> >> can't implement lesser ocl versions (since this is defn not required >> >> for ocl 1.x and I don't *think* it is required for 2.x, but I haven't >> >> checked the spec.. I guess if intel supports 2.x then it isn't >> >> required..) >> > >> > >> > You can still carve up the address space and put local memory at some >> > absurdly high address and do >> > >> > if (ptr > 0xffffffffffff0000) >> > store_local(ptr & 0xffff) >> > else >> > store_global(ptr) >> > >> >> That might be an option if we need to physically store a fat pointer >> (but I'm not convinced we need to). But as a general solution it is a >> horrible idea. You either loose information (since when was TMI a >> problem for compilers?) or you get to teach constant folding that even >> though it doesn't know the value of 'a' it does know the value of 'a & >> 0xffffffffffff0000', which is insanity. >> >> And simply not optimizing away the global/local/private turns every >> load/store into something that looks like: >> >> cmps.f.lt p0.x, c4.x, r0.z >> ; delay 6 instructions so branch can see the value in p0.x >> (rpt5)nop >> br !p0.x, #3 >> ldg.u32 r0.w, r0.y, 1 >> jump #11 >> (jp)cmps.f.lt p0.x, c4.y, r0.z >> (rpt5)nop ; delay 6 instruction slots >> br !p0.x, #8 >> ldl.u32 r0.w, r0.y, 1 >> jump #3 >> ; private is really just global with a driver provided buffer >> ; and compiler computed offset >> (jp)add.s r0.y, c10.x ; add base address of buffer >> (rpt3)nop ; alu results not immediately avail and no >> other instr to fill the empty slots >> add.s r0.y, ... offset calculated from local_id >> (rpt5)nop ; 6 slots for result from alu avail to mem >> instructions >> ldg.u32 r0.w, r0.y, 1 >> (jp)(sy)... and now we have a result.. >> >> >> which is simply not an option! And the result is actually worse since >> we end up with 64b pointer math lowered to 32b instructions! >> >> tbh, I *really* don't see the problem with fat pointers, but if >> someone else doesn't want deref_cast to use fat pointers, then I must >> insist on a different intrinsic which does, and a compiler option for >> vtn to choose which to emit. > > > I'm sorry. That comment was NOT intended as a suggestion for universal > alternative to fat pointers. It was merely to demonstrate how an > architecture like Intel's which uses different messages for local vs. global > could implement physical 64-bit (uint64_t) generic pointers. Someone (looks > like Rob from the indentation) made a comment about difficulties > implementing OpenCL 2.x and I was just trying to show that it could be done. > Of course, stuffing the extra information in a vector component is easier. > > This was a side comment to what I thought was a side comment in the > discussion. >
oh, my bad, I misinterpreted ;-) yeah, this makes sense for the special case where you need storage, at least for 64b memory model. BR, -R _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev