On Fri, Aug 9, 2013 at 3:48 PM, Christian König <deathsim...@vodafone.de> wrote: > Am 09.08.2013 15:29, schrieb Marek Olšák: >> >> On Fri, Aug 9, 2013 at 10:34 AM, Christian König >> <deathsim...@vodafone.de> wrote: >>> >>> Am 08.08.2013 21:38, schrieb Alex Deucher: >>> >>>> On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák <mar...@gmail.com> wrote: >>>>> >>>>> On Thu, Aug 8, 2013 at 6:57 PM, Christian König >>>>> <deathsim...@vodafone.de> >>>>> wrote: >>>>>> >>>>>> Am 08.08.2013 16:33, schrieb Marek Olšák: >>>>>>> >>>>>>> On Thu, Aug 8, 2013 at 3:09 PM, Christian König >>>>>>> <deathsim...@vodafone.de> >>>>>>> wrote: >>>>>>>> >>>>>>>> Am 08.08.2013 14:38, schrieb Marek Olšák: >>>>>>>> >>>>>>>>> .On Thu, Aug 8, 2013 at 9:47 AM, Christian König >>>>>>>>> <deathsim...@vodafone.de> wrote: >>>>>>>>>> >>>>>>>>>> Am 08.08.2013 02:20, schrieb Marek Olšák: >>>>>>>>>> >>>>>>>>>>> FMASK is bound as a separate texture. For every texture, there >>>>>>>>>>> can >>>>>>>>>>> be >>>>>>>>>>> an FMASK. Therefore a separate array of resource slots has to be >>>>>>>>>>> added. >>>>>>>>>>> >>>>>>>>>>> This adds a new mechanism for emitting resource descriptors, its >>>>>>>>>>> features >>>>>>>>>>> are: >>>>>>>>>>> - resource descriptors are stored in an ordinary buffer (not in a >>>>>>>>>>> CS) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Having resource descriptors outside of the CS has two problems >>>>>>>>>> that >>>>>>>>>> we >>>>>>>>>> need >>>>>>>>>> to solve first: >>>>>>>>>> >>>>>>>>>> 1. Fine grained descriptor updates doesn't work, I already tried >>>>>>>>>> that. >>>>>>>>>> The >>>>>>>>>> problem is that unlike previous asics descriptors are now a memory >>>>>>>>>> block, >>>>>>>>>> so >>>>>>>>>> no longer part of the CP context. So when we (for example) have a >>>>>>>>>> draw >>>>>>>>>> command executing and the next draw command is using new resources >>>>>>>>>> for >>>>>>>>>> a >>>>>>>>>> specific slot we would either block until the first draw command >>>>>>>>>> is >>>>>>>>>> finished >>>>>>>>>> (which is bad for performance) or change the descriptors while >>>>>>>>>> they >>>>>>>>>> are >>>>>>>>>> still in use (which results in VM faults). >>>>>>>>> >>>>>>>>> So what would the proper solution be here? Do I need to flush some >>>>>>>>> caches or would moving the descriptor updates to the constant IB >>>>>>>>> fix >>>>>>>>> that? >>>>>>>> >>>>>>>> >>>>>>>> Actually the current implementation worked better than anything else >>>>>>>> I >>>>>>>> tried. >>>>>>>> >>>>>>>> When you really need the resource descriptors in a separate buffer >>>>>>>> you >>>>>>>> need >>>>>>>> to use one buffer for each draw call and always write the full >>>>>>>> buffer >>>>>>>> contents (no partial updates). Flushing anything won't really help >>>>>>>> either.. >>>>>>>> >>>>>>>> The only solution I see using one buffer is to block until the last >>>>>>>> draw >>>>>>>> call is finished with WAIT_REG_MEM, but that would be quite >>>>>>>> disastrous >>>>>>>> for >>>>>>>> performance. >>>>>>>> >>>>>>>> >>>>>>>>>> 2. If my understand is correct when they are embedded the >>>>>>>>>> descriptors >>>>>>>>>> are >>>>>>>>>> preloaded into the caches while executing the IB, so to archive >>>>>>>>>> the >>>>>>>>>> same >>>>>>>>>> speed with descriptors outside of the IB you need to add >>>>>>>>>> additional >>>>>>>>>> commands >>>>>>>>>> to the constant IB which is new to SI and we currently doesn't >>>>>>>>>> support >>>>>>>>>> in >>>>>>>>>> the CS interface. >>>>>>>>> >>>>>>>>> There seems to be support for the constant IB. The CS ioctl chunk >>>>>>>>> ID >>>>>>>>> is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in >>>>>>>>> si_vm_packet3_ce_check. Is there anything missing? >>>>>>>> >>>>>>>> >>>>>>>> The userspace side seems to be missing and except for throwing NOP >>>>>>>> packets >>>>>>>> into it we never tested it. I know from the closed source side that >>>>>>>> it >>>>>>>> actually was quite tricky for them to get working. >>>>>>>> >>>>>>>> Additional to that please note that I'm not 100% sure that just >>>>>>>> putting >>>>>>>> the >>>>>>>> descriptors into the IB is really helping here. It was just the most >>>>>>>> simplest solution to avoid allocating a new buffer on each draw >>>>>>>> call.. >>>>>>> >>>>>>> I understand. I don't really need to have resource descriptors in a >>>>>>> separate buffer, all I need is these 3 basic features a gallium >>>>>>> driver >>>>>>> should support: >>>>>>> - fine-grained resource updates (mainly for performance, see below) >>>>>>> - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0) >>>>>>> - no GPU crash if a shader is using SAMPLER[15] but there are no >>>>>>> samplers >>>>>>> bound >>>>>>> >>>>>>> FYI, partial sampler view and sampler state updates are coming to >>>>>>> gallium, Brian Paul already has some patches, it's just a matter of >>>>>>> time now. Vertex and constant buffer states already support partial >>>>>>> updates. >>>>>> >>>>>> >>>>>> That shouldn't be to much off a problem. >>>>>> >>>>>> Just allocate a state at startup and initialize it with the proper pm4 >>>>>> commands for 16 samplers, then update the resource descriptors in that >>>>>> state >>>>>> when we change the bound textures/samplers/views/constants/whatever. >>>>>> All >>>>>> we >>>>>> need to do then is setting the emitted state to NULL so that it gets >>>>>> re-emitted in the next draw command. >>>>> >>>>> That would re-emit all 16 shader resources even if just one of them >>>>> needs to be changed. I was trying to avoid this inefficiency. Is it >>>>> really impossible to emit just one resource descriptor and keep the >>>>> others unchanged? This is a basic D3D10/11 feature, for example: >>>>> >>>>> void ID3D11DeviceContext::VSSetShaderResources( >>>>> [in] UINT StartSlot, >>>>> [in] UINT NumViews, >>>>> [in] ID3D11ShaderResourceView *const *ppShaderResourceViews >>>>> ); >>>>> >>>>> If the constant engine is required to implement this interface >>>>> efficiently, then I'd like to work on constant IB support. >>>> >>>> You'll need to either store them in memory or re-emit them if you >>>> store them in the IB. The CE is mainly there so that it can prime the >>>> TC in parallel with the command stream processing. >>> >>> >>> Yeah indeed. The CE is just for prefetching everything into caches and >>> doesn't really help here. >>> >>> The only two options I see is either fully emitting it into the command >>> stream whenever anything changes or allocating a new buffer for the >>> resources on each new draw call, copying over the old state and then >>> setting >>> just the things that changed. Both options have their pro and cons, no >>> idea >>> what might be better. >>> >>> Fact is the resource descriptors are not allowed to change as long as the >>> shaders are running. >> >> I think flushing the TC before changing the descriptors should help. >> If not, then PS_PARTIAL_FLUSH or some other equivalent of WAIT_UNTIL >> should. > > > Hui? Why? > > The scalar engine of the shaders loads the resource descriptors from memory, > so as long as the shaders are running we can't modify the descriptors. So > the only option here is to either have multiple version of the descriptors > around at the same time or wait for the shaders to finish before modifying > the descriptors which means draining the shader pipeline quite a bit. > > Or am I missing something here?
What if I kept the current emission code, and only allocated a new buffer at the end of the emit function, copied all descriptors to it using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The buffer where the descriptors are updated using WRITE_DATA would act as a staging buffer only and shaders would always read from the fresh new copy. Does that sound good to you? Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev