Re: [PATCH 00/11] add recovery entity

zhoucm1 Thu, 04 Aug 2016 02:05:44 -0700


On 2016年08月04日 16:39, Christian König wrote:

Am 04.08.2016 um 05:10 schrieb zhoucm1:
On 2016年08月03日 21:43, Christian König wrote:
Well that is a clear NAK to this whole approach.
Submitting the recovery jobs to the scheduler is reentrant becausethe scheduler is the one who originally signaled us of a timeout.
we have reset all recovery jobs, right? Could we think those jobs aresame as others?
No they aren't. For recovery jobs you don't want a timeout whichtriggers another GPU reset while your first one is still under way.
Why not submit the recovery jobs to the hardware ring directly?
Yeah, this is also what I did at begin.
The mainly reasons are:
0. recovery jobs need to wait itself page table recovery completed atleast.
Well, as noted in the other thread we need to recover the GART tablewith the CPU anyway.
1. direct submission is using run_job which is used by scheduler aswell, which could introduce conflicts.
The scheduler should be completely stopped during the GPU reset, sothere shouldn't be any other processing.
2. if all vm clients use one sdma engine, the speed of restoring isslow. If we can use itself pte ring, then we will use all sdmaengines for them.
A single SDMA engine should be able to max out the PCIe speed in onedirection, no need to offload that to both engines. If we really needboth engines we could also simply handle that in the recovery code aswell.
3. if just one entity is to recover all vm page tables, then theirrecovery jobs will have potential dependency, the later is waitingthe front. If they have their own entity, there will be no dependencybetween them.4. if recovery entity is based on kernel run queue, then the recoveryjobs could be executed with pt jobs at the same time.
Well that's exactly the reason why I don't want to push those jobsthrough the scheduler. The scheduler should be stopped during the GPUreset so that nothing else happens with the hardware.
E.g. when other jobs run concurrently with the recovery jobs you canhave all kinds of problems like one SDMA engine is doing a recoverywhile the other one does a backup on the same BO etc..

OK, I got your mean. That means all recovery pt jobs of pt schedulermust be completed before directly submitting recovery job, which indeedsimply many problems, especially kinds of fence sync.


Regards,
David zhou


Regards,
Christian.


Above is why I introduce recovery entity and recovery run queue.

Regards,
David Zhou


Regards,
Christian.

Am 28.07.2016 um 12:13 schrieb Chunming Zhou:

every vm has itself recovery entity, which is used to reovery pagetable from their shadow.

They don't need to wait front vm completed.
And also using all pte rings can speed reovery.

every scheduler has its own recovery entity, which is used to savehw jobs, and resubmit from it, which solves the conflicts betweenreset thread and scheduler thread when run job.


And some fixes when doing this improment.

Chunming Zhou (11):
   drm/amdgpu: hw ring should be empty when gpu reset
   drm/amdgpu: specify entity to amdgpu_copy_buffer
   drm/amd: add recover run queue for scheduler
   drm/amdgpu: fix vm init error path
   drm/amdgpu: add vm recover entity
   drm/amdgpu: use all pte rings to recover page table
   drm/amd: add recover entity for every scheduler
   drm/amd: use scheduler to recover hw jobs
   drm/amd: hw job list should be exact
   drm/amd: reset jobs to recover entity
   drm/amdgpu: no need fence wait every time

  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_benchmark.c |   3 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |  35 +++++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c      |  11 +++
  drivers/gpu/drm/amd/amdgpu/amdgpu_test.c      |   8 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   5 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |  26 ++++--

drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 129+++++++++++++-------------

  drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   4 +-
  9 files changed, 134 insertions(+), 92 deletions(-)


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 00/11] add recovery entity

Reply via email to