Re: force app kill patch

Christian König Wed, 18 Apr 2018 02:04:44 -0700

Am 18.04.2018 um 11:00 schrieb Liu, Monk:

1.Drm_sched_entity_fini(): it exit right after entity->job_queueempty, [ but that time scheduler is not fast enough to deal with thisentity now ]
That should never happen.
The last job from the entity->job_queue is only removed after thescheduler is done with the entity (at least that was the originalidea, not sure if that still works as expected).
[ML] no that’s not true and we already catch the kernel NULL pointerissue with a entity->last_scheduled fence get double put , exactlylike the way I described in the scenario …
Pixel already fixed it by moving the put/get pair onentity->last_scheduled prior to spsc_queue_pop() and the race issue istherefore avoided

Yeah, already seen and reviewed that. That's a good catch, please makesure that this gets pushed to amd-staging-drm-next ASAP.


Christian.

/Monk

*From:*Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
*Sent:* 2018年4月18日16:36

*To:* Liu, Monk <monk....@amd.com>; Koenig, Christian<christian.koe...@amd.com>; Deng, Emily <emily.d...@amd.com>

*Cc:* amd-gfx@lists.freedesktop.org
*Subject:* Re: force app kill patch

    See that in “sched_entity_fini”, we only call
    dma_fence_put(entity->last_scheduled” under the condition of “If
    (entity->fini_status)”, so

    This way there is memory leak for the case of “entity->fini_stats ==0”

Good catch, we indeed should fix that.


    1.Drm_sched_entity_fini(): it exit right after entity->job_queue
    empty, [ but that time scheduler is not fast enough to deal with
    this entity now ]

That should never happen.

The last job from the entity->job_queue is only removed after thescheduler is done with the entity (at least that was the originalidea, not sure if that still works as expected).


Regards,
Christian.

Am 18.04.2018 um 09:20 schrieb Liu, Monk:

    *Correctio for the scenario *

    After we move fence_put(entity->last_sched) out of the fini_status
    check:

    A potential race issue for the scenario:

    1.Drm_sched_entity_fini(): it exit right after entity->job_queue
    empty, [ but that time scheduler is not fast enough to deal with
    this entity now ]

    2.Drm_sched_entity_cleanup() : it call
    dma_fence_put(entity->last_scheduled)  [ but this time
    entity->last_scheduled  actually points to the fence prior to the
    real last one ]

    3.Scheduler_main() now dealing with this entity: it call
    dma_fence_put(entity->last_scheduled)    [   Now this fence get
    double put !!!  ]

    4.Scheduler_main() now call dma_fence_get() on the *real* last one !

    So eventually the real last one fence triggers memory leak and
    more critical the double put fence cause NULL pointer access

    /Monk

    *From:*Liu, Monk
    *Sent:* 2018年4月18日15:11
    *To:* Koenig, Christian <christian.koe...@amd.com>
    <mailto:christian.koe...@amd.com>; Deng, Emily
    <emily.d...@amd.com> <mailto:emily.d...@amd.com>
    *Cc:* amd-gfx@lists.freedesktop.org
    <mailto:amd-gfx@lists.freedesktop.org>
    *Subject:* force app kill patch

    Hi Christian & Emily

    I think the v4 fix for “fix force app kill hang”is still not good
    enough:

    First:

    See that in “sched_entity_fini”, we only call
    dma_fence_put(entity->last_scheduled”under the condition of “If
    (entity->fini_status)”, so

    This way there is memory leak for the case of “entity->fini_stats ==0”

    Second:

    If we move dma_fence_put(entity->last_scheduled) out of the
    condition of “if (entity->fini_status)”, the memory leak issue can
    be fixed

    But there will be kernel NULL pointer access, I think the time you
    call dma_fence_put(entity->last_scheduled”) may actually executed
    **not**

    On the last scheduled fence of this entity, because it is run
    without “thread_park/unpark”pair which to make sure scheduler not
    dealing this entity

    So with certain race issue, here is the scenario:

    1.scheduler is doing the dma_fence_put() on the 1^st fence,

    2.scheduler set entity->last_scheduled to 1^st fence

    3.now sched_entity_fini() run, and it call dma_fence_put() on
    entity->last_scheduled

    4.now this 1^st fence is actually put double time and the real
    last fence won’t get put by expected

    any idea?

    /Monk




    _______________________________________________

    amd-gfx mailing list

    amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>

    https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: force app kill patch

Reply via email to