Re: TDR and VRAM lost handling in KMD (v2)

Christian König Thu, 12 Oct 2017 01:41:22 -0700

Yeah, that sounds like a plan to me.

Going to commit the patches I've already done in a minute, they stillseem to implement quite a bit of what we want to do here.


Regards,
Christian.

Am 12.10.2017 um 10:03 schrieb Liu, Monk:

V2 summary

Hi team

*please give your comments*
lWhen a job timed out (set from lockup_timeout kernel parameter), WhatKMD should do in TDR routine :
1.Update adev->*gpu_reset_counter*, and stop scheduler first

2.Set its fence error status to “*ECANCELED*”,
3.Find the *context* behind this job, and set this *context* as“*guilty*”(will have a new member field in context structure – *boolguilty*)
a)There will be “*bool * guilty*” in entity structure, which points toits father context’s member – “*bool guilty” *when contextinitialized**, so no matter we get context or entity, we always knowif it is “guilty”
b)For kernel entity that used for VM updates, there is no context backit, so kernel entity’s “bool *guilty” always “NULL”.
c)The idea to skip the whole context is for consistence consideration,because we’ll fake signal the hang job in job_run(), so all jobs inits context shall be dropped otherwise either bad drawing/computingresults or more GPU hang.
**
4.Do GPU reset, which is can be some callbacks to let bare-metal andSR-IOV implement with their favor style
5.After reset, KMD need to aware if the VRAM lost happens or not,bare-metal can implement some function to judge, while for SR-IOV Iprefer to read it from GIM side (for initial version we consider it’salways VRAM lost, till GIM side change aligned)
6.If VRAM lost hit, update adev->*vram_lost_counter*.

7.Do GTT recovery and shadow buffer recovery.

8.Re-schedule all JOBs in mirror list and restart scheduler

lFor GPU scheduler function --- job_run()
1.Before schedule a job to ring, checks if job->*vram_lost_counter* ==adev->*vram_lost_counter*, and drop this job if mismatch
2.Before schedule a job to ring, checks if job->entity->*guilty* isNULL or not, *and drop this job if (guilty!=NULL && *guilty == TRUE)*
3.if a job is dropped:

a)set job’s sched_fence status to “*ECANCELED*”

b)fake/force signal job’s hw fence (no need to set hw fence’s status)

lFor cs_wait() IOCTL:
After it found fence signaled, it should check if there is error onthis fence and return the error status of this fence
lFor cs_wait_fences() IOCTL:

Similar with above approach

lFor cs_submit() IOCTL:
1.check if current ctx been marked “*guilty*”and return “*ECANCELED*”if so.
2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, andreturn “*ECANCELED*” if ctx->*vram_lost_counter* !=job->*vram_lost_counter* (Christian already submitted this patch)
a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ?that way UMD know this context is under “device lost”
lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:

lFor amdgpu_ctx_query():
n*Don’t update ctx->reset_counter when querying this function,otherwise the query result is not consistent *
nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctxis “*guilty*”, no need to check “ctx->reset_counter”
nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if thectx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *
nSet out->state.reset_status to “AMDGPU_CTX_NO_RESET” ifctx->reset_counter == adev->reset_counter
nSet out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” ifctx->vram_lost_counter != adev->vram_lost_counter
udiscussion: can we return “ENODEV” for amdgpu_ctx_query() ifctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD knowthis context is under “device lost”
nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET orits flags is “AMDGPU_CTX_FLAG_VRAM_LOST”
For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’mnot familiar with UMD , David Mao shall have some discuss on it withNicolai), the new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context ,otherwise CS_SUBMIT will not block it which isn’t correct
Need your feedback, thx
*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *OnBehalf Of *Liu, Monk
*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian <christian.koe...@amd.com>; Haehnle, Nicolai<nicolai.haeh...@amd.com>; Olsak, Marek <marek.ol...@amd.com>;Deucher, Alexander <alexander.deuc...@amd.com>*Cc:* Ramirez, Alejandro <alejandro.rami...@amd.com>;amd-gfx@lists.freedesktop.org; Filipas, Mario <mario.fili...@amd.com>;Ding, Pixel <pixel.d...@amd.com>; Li, Bingley <bingley...@amd.com>;Jiang, Jerry (SW) <jerry.ji...@amd.com>
*Subject:* TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,
We need to achieve some agreements on what should MESA/UMD do and whatshould KMD do, *please give your comments with **“okay”or “No”and youridea on below items,*
lWhen a job timed out (set from lockup_timeout kernel parameter), WhatKMD should do in TDR routine :
1.Update adev->*gpu_reset_counter*, and stop scheduler first,(*gpu_reset_counter* is used to force vm flush after GPU reset, out ofthis thread’s scope so no more discussion on it)
2.Set its fence error status to “*ETIME*”,

3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
4.Kick out this job from scheduler’s mirror list, so this job won’tget re-scheduled to ring anymore.
5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set alltheir fence status to “*ECANCELED*”
*6.*Force signal all fences that get kicked out by above twosteps,*otherwise UMD will block forever if waiting on those fences*
7.Do gpu reset, which is can be some callbacks to let bare-metal andSR-IOV implement with their favor style
8.After reset, KMD need to aware if the VRAM lost happens or not,bare-metal can implement some function to judge, while for SR-IOV Iprefer to read it from GIM side (for initial version we consider it’salways VRAM lost, till GIM side change aligned)
9.If VRAM lost not hit, continue, otherwise:

a)Update adev->*vram_lost_counter*,
b)Iterate over all living ctx, and set all ctx as “*guilty*”since VRAMlost actually ruins all VRAM contents
c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fencestatus to “*ECANCELDED*”
10.Do GTT recovery and VRAM page tables/entries recovery (optional, dowe need it ???)
11.Re-schedule all JOBs remains in mirror list to ring again andrestart scheduler (for VRAM lost case, no JOB will re-scheduled)
lFor cs_wait() IOCTL:
After it found fence signaled, it should check with*“dma_fence_get_status” *to see if there is error there,
And return the error status of fence

lFor cs_wait_fences() IOCTL:

Similar with above approach

lFor cs_submit() IOCTL:
It need to check if current ctx been marked as “*guilty*”and return“*ECANCELED*”if so
lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
This way, UMD can also block app from submitting, like @Nicolaimentioned, we can cache one copy of *vram_lost_counter* when enumeratephysical device, and deny all
gl-context from submitting if the counter queried bigger than that onecached in physical device. (looks a little overkill to me, but easy toimplement )
UMD can also return error to APP when creating gl-context if foundcurrent queried*vram_lost_counter *bigger than that one cached inphysical device.
BTW: I realized that gl-context is a little different with kernel’scontext. Because for kernel. BO is not related with context but onlywith FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed althoughKMD will do its job as bottom line
lBasically “vram_lost_counter”is exposure by kernel to let UMD takethe control of robust extension feature, it will be UMD’s call tomove, KMD only deny “guilty”context from submitting
Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: TDR and VRAM lost handling in KMD (v2)

Reply via email to