Re: TDR and VRAM lost handling in KMD (v2)

Nicolai Hähnle Fri, 13 Oct 2017 00:43:44 -0700

On 13.10.2017 05:57, Liu, Monk wrote:

That sounds sane, but unfortunately might not be possible with the existing 
IOCTL. Keep in mind that we need to keep backward compatibility here.
unfortunately the current scheme on amdgpu_ctx_query() won’t work withTDR feature, which is aim to support vulkan/mesa/close-ogl/radv …
It’s enumeration is too limited to MESA implement …
*Do you have good idea ? both keep the compatibility here and giveflexible ?*
*looks like we need to add a new amdgpu_ctx_query_2() INTERFACE ….*

·*A new IOCTL added for context:*

Void amdgpu_ctx_reinit(){

         Ctx→vram_lost_counter = adev→vram_lost_counter;

         Ctx→reset_counter = adev→reset_counter;

}


Mhm, is there any advantage to just creating a new context?

[ML] sorry, this function is wrong, here is my original idea:

MESA can create a new ctx based on an old one,like:

Create gl-ctx1,

Create BO-A under gl-ctx1 …

VRAM LOST
Create gl-ctx2 from gl-ctx1 (share list, I’m not familiar with UMD,David Mao an Nicolai can input)
Create BO-b UNDER gl-ctx2 …

Submit job upon gl-ctx2, but it can refer to BO-A,
With our design, kernel won’t block submit from context2 (from gl-ctx2)since its vram_lost_counter equals to latest adev copy
But gl-ctx2 is a clone from gl-ctx1, the only difference is thevram_lost/gpu_reset counter is updated to latest one
So logically we should also block submit from gl-ctx2 (mapping to kernelcontext2), and we failed do so …
That’s why I want to add a new “amdgpu_ctx_clone”, which should work like:

Int amdgpu_ctx_clone(struct context *original, struct context *new) {

     New->vram_lost_counter = original->vram_lost_counter;

     New->gpu_reset_counter = original->gpu_reset_counter;

}

From the Mesa perspective, I don't think we would need to use that aslong as we can query the device VRAM lost counter.

Personally, I think it's better for the UMD to build its policy onlower-level primitives such as the VRAM lost counter query.


Cheers,
Nicolai


BR Monk

*From:*Koenig, Christian
*Sent:* 2017年10月12日19:50

*To:* Liu, Monk <monk....@amd.com>; Haehnle, Nicolai<nicolai.haeh...@amd.com>; Michel Dänzer <mic...@daenzer.net>; Olsak,Marek <marek.ol...@amd.com>; Deucher, Alexander<alexander.deuc...@amd.com>; Zhou, David(ChunMing)<david1.z...@amd.com>; Mao, David <david....@amd.com>*Cc:* Ramirez, Alejandro <alejandro.rami...@amd.com>;amd-gfx@lists.freedesktop.org; Filipas, Mario <mario.fili...@amd.com>;Ding, Pixel <pixel.d...@amd.com>; Li, Bingley <bingley...@amd.com>;Jiang, Jerry (SW) <jerry.ji...@amd.com>

*Subject:* Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 13:37 schrieb Liu, Monk:

    Hi team

    Very good, many policy and implement are agreed, looks we only have
    some arguments in amdgpu_ctx_query(), well I also confused with the
    current implement of it, ☹

    First, I want to know if you guys agree that we*don't update
    ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
    the query result always consistent upon a given context,

That sounds like a good idea to me, but I'm not sure if it won't breakuserspace (I don't think so). Nicolai or Marek need to comment.



    Second, I want to say that for KERNEL, it shouldn't use the term
    from MESA or OGL or VULKAN, e.g. kernel shouldn't use
    AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...

    Because that way kernel will be very limited to certain UMD, so I
    suggest we totally re-name the context status, and each UMD has its
    own way to map the kernel context's result to gl-context/vk-context/etc…


Yes, completely agree.


    Kernel should only provide below **FLAG bits** on a given context:

    ·Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR
    detects a job hang, KMD set the context behind this context as "guilty"

    ·Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as
    there is a VRAM lost event hit after this context created, we mark
    this context "VRAM_LOST", so UMD can say that all BO under this
    context may lost their content,  since kernel have no relationship
    between context and BO so this is UMD's call to judge if a BO
    considered "VRAM lost" or not.

    ·Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a gpu
    reset occurred after context creation, this flag shall be set

That sounds sane, but unfortunately might not be possible with theexisting IOCTL. Keep in mind that we need to keep backward compatibilityhere.



    Sample code:

    Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {

             if (ctx- >vram_lost_counter != adev->vram_lost_counter)

                     out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

    if (ctx- >reset_counter != adev→reset_counter) {

    out- >status |= AMDGPU_CTX_STATE_RESET;

    if (ctx- >guilty == TRUE)

                             out- >status |= AMDGPU_CTX_STATE_GUILTY;

    }

    return 0;

    }

    For UMD if it found "out.status == 0" means there is no gpu reset
    even occurred, the context is totally regular,

    ·*A new IOCTL added for context:*

    Void amdgpu_ctx_reinit(){

             Ctx→vram_lost_counter = adev→vram_lost_counter;

             Ctx→reset_counter = adev→reset_counter;

    }


Mhm, is there any advantage to just creating a new context?

Regards,
Christian.


    *if UMD decide *not* to release the "guilty" context but continue
    using it after UMD acknowledged GPU hang on certain job/context, I
    suggest UMD call "amdgpu_ctx_reinit()":*

    That way after you re-init() this context, you can get updated
    result from "amdgpu_ctx_query", which will probably give you
    "out.status == 0" as long as no gpu reset/vram lost hit after re-init().

    BR Monk

    -----Original Message-----
    From: Koenig, Christian
    Sent: 2017年10月12日18:13
    To: Haehnle, Nicolai <nicolai.haeh...@amd.com>
    <mailto:nicolai.haeh...@amd.com>; Michel Dänzer <mic...@daenzer.net>
    <mailto:mic...@daenzer.net>; Liu, Monk <monk....@amd.com>
    <mailto:monk....@amd.com>; Olsak, Marek <marek.ol...@amd.com>
    <mailto:marek.ol...@amd.com>; Deucher, Alexander
    <alexander.deuc...@amd.com> <mailto:alexander.deuc...@amd.com>;
    Zhou, David(ChunMing) <david1.z...@amd.com>
    <mailto:david1.z...@amd.com>; Mao, David <david....@amd.com>
    <mailto:david....@amd.com>
    Cc: Ramirez, Alejandro <alejandro.rami...@amd.com>
    <mailto:alejandro.rami...@amd.com>; amd-gfx@lists.freedesktop.org
    <mailto:amd-gfx@lists.freedesktop.org>; Filipas, Mario
    <mario.fili...@amd.com> <mailto:mario.fili...@amd.com>; Ding, Pixel
    <pixel.d...@amd.com> <mailto:pixel.d...@amd.com>; Li, Bingley
    <bingley...@amd.com> <mailto:bingley...@amd.com>; Jiang, Jerry (SW)
    <jerry.ji...@amd.com> <mailto:jerry.ji...@amd.com>
    Subject: Re: TDR and VRAM lost handling in KMD (v2)

    Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:

    > On 12.10.2017 11:35, Michel Dänzer wrote:

    >> On 12/10/17 11:23 AM, Christian König wrote:

    >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:

    >>>> On 12.10.2017 10:49, Christian König wrote:

    >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter

    >>>>>> does not imply that the context was lost.

    >>>>>>

    >>>>>> The way I understand it, we should return

    >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
adev->vram_lost_counter.

    >>>>>>

    >>>>>> As far as I understand it, the case of !guilty &&

    >>>>>> ctx->reset_counter != adev->reset_counter &&

    >>>>>> ctx->vram_lost_counter ==

    >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,

    >>>>>> adev->because a

    >>>>>> GPU reset occurred, but it didn't affect our context.

    >>>>> I disagree on that.

    >>>>>

    >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there

    >>>>> was a reset but we haven't been causing it.

    >>>>>

    >>>>> That the OpenGL extension is specified otherwise is unfortunate,

    >>>>> but I think we shouldn't use that for the kernel interface here.

    >>>> Two counterpoints:

    >>>>

    >>>> 1. Why should any application care that there was a reset while it

    >>>> was idle? The OpenGL behavior is what makes sense.

    >>>

    >>> The application is certainly not interest if a reset happened or

    >>> not, but I though that the driver stack might be.

    >>>

    >>>>

    >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today

    >>>> because we never return it :)

    >>>>

    >>>

    >>> Good point.

    >>>

    >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which

    >>>> is in line with the OpenGL spec: we're conservatively returning

    >>>> that a reset happened because we don't know whether the context was

    >>>> affected, and we return UNKNOWN because we also don't know whether

    >>>> the context was guilty or not.

    >>>>

    >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&

    >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a

    >>>> refinement and improvement of the current, overly conservative

    >>>> behavior.

    >>>

    >>> Ok let's reenumerate what I think the different return values should

    >>> mean:

    >>>

    >>> * AMDGPU_CTX_GUILTY_RESET

    >>>

    >>> guilty is set to true for this context.

    >>>

    >>> * AMDGPU_CTX_INNOCENT_RESET

    >>>

    >>> guilty is not set and vram_lost_counter has changed.

    >>>

    >>> * AMDGPU_CTX_UNKNOWN_RESET

    >>>

    >>> guilty is not set and vram_lost_counter has not changed, but

    >>> gpu_reset_counter has changed.

    >>

    >> I don't understand the distinction you're proposing between

    >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both

    >> cases you're describing should return either

    >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or

    >> AMDGPU_CTX_UNKNOWN_RESET if it's not.

    >

    > I think it'd make more sense if it was called

    > "AMDGPU_CTX_UNAFFECTED_RESET".

    >

    > So:

    > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and

    > we know that it's the context's fault

    > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,

    > and we know that it *wasn't* the context's fault (no context job

    > active)

    > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,

    > and we don't know who's responsible (this could be returned in the

    > unlikely case where context A's gfx job has not yet finished, but

    > context B's gfx job has already started; it could be the fault of A,

    > it could be the fault of B -- which somehow manages to hang a part of

    > the hardware that then prevents A's job from finishing -- or it could

    > be both; but it's a bit academic)

    > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context

    > wasn't affected

    >

    > This last value would currently just be discarded by Mesa (because we

    > should only disturb applications when we have to), but perhaps

    > somebody else could find it useful?

    Yes, that's what I had in mind as well.

    Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
    actually was a reset and that certainly doesn't sound correct to me.

    Regards,

    Christian.

    >

    > Cheers,

    > Nicolai


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: TDR and VRAM lost handling in KMD (v2)

Reply via email to