At the very least I don't get the hangs like I used to before the patchset so this patch isn't regressing any of that behaviour.

Tom


On 13/10/17 07:06 AM, Liu, Monk wrote:
Yeah, this one looks good

You can put my reviewed-by on it

*From:*Koenig, Christian
*Sent:* 2017年10月13日18:14
*To:* Liu, Monk <monk....@amd.com>; Nicolai Hähnle <nhaeh...@gmail.com>; amd-gfx@lists.freedesktop.org
*Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Good point as well. How about the attached version?

This time we keep an extra reference in amd_sched_process_job() until we are sure that we don't need the s_fence any more.

Regards,
Christian.

Am 13.10.2017 um 11:13 schrieb Liu, Monk:

    your patch looks good,  do you think we should also do this:

      void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
      {
    -       int ret = fence_signal(&fence->scheduled);
    +       int ret;
    +
    +       fence_get(&fence->scheduled;)
    +       ret = fence_signal(&fence->scheduled);

             if (!ret)
                     FENCE_TRACE(&fence->scheduled, "signaled from irq
    context\n");
             else
                     FENCE_TRACE(&fence->scheduled, "was already
    signaled\n");
    +       fence_put(&fence->scheduled);
      }

    ------------------------------------------------------------------------

    *From:*Koenig, Christian
    *Sent:* Friday, October 13, 2017 5:00:27 PM
    *To:* Liu, Monk; Nicolai Hähnle; amd-gfx@lists.freedesktop.org
    <mailto:amd-gfx@lists.freedesktop.org>
    *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2

        There is chance that free_job() called before that
        “trace_amd_sched_process_job”, correct?

    Correct, but that is harmless.

    Take a look what trace_amd_sched_process_job actually does, it just
    prints the pointer of the fence structure (but the pointer might be
    stale at this point).

    Nevertheless you are right that this is really ugly.

    How about the attached patch to fix the issue?

    Regards,
    Christian.

    Am 13.10.2017 um 10:51 schrieb Liu, Monk:

        The free_job() is called in sched_job_finish() which is queued
        on a WORK and scheduled from that “amd_sched_fence_finished()”

        So the finishing timing of free_job() is asynchronized with
        sched_process_job()

        There is chance that free_job() called before that
        “trace_amd_sched_process_job”, correct?

        And if so the s_fence referred by it maybe a wild pointer

        BR Monk

        *From:*Liu, Monk
        *Sent:* 2017年10月13日16:49
        *To:* Koenig, Christian <christian.koe...@amd.com>
        <mailto:christian.koe...@amd.com>; Nicolai Hähnle
        <nhaeh...@gmail.com> <mailto:nhaeh...@gmail.com>;
        amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
        *Subject:* RE: regression with
        d6c650c0a8f6f671e49553725e1db541376d95f2

        No that’s not true

        The free_job() is called in sched_job_finish() which is queued
        on a WORK and scheduled from that “amd_sched_fence_finished()”

        So the finishing timing of free_job() is asynchronized with
        sched_process_job()

        How can you sure free_job() must before
        “trace_amd_sched_process_job” ?

        *From:*Koenig, Christian
        *Sent:* 2017年10月13日16:43
        *To:* Liu, Monk <monk....@amd.com <mailto:monk....@amd.com>>;
        Nicolai Hähnle <nhaeh...@gmail.com <mailto:nhaeh...@gmail.com>>;
        amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
        *Subject:* Re: regression with
        d6c650c0a8f6f671e49553725e1db541376d95f2

        The free_job() callback is called only way after the job has
        finished.

        That is one change actually made by you to the code :)

        Christian.

        Am 13.10.2017 um 10:39 schrieb Liu, Monk:

            I doubt it would always work fine…

            First, we have FENCE_TRACE reference s_fence->finished after
            “fence_signal(&fence->finished)”

            Second, we have trace_amd_sched_proess_job(s_fence) after
            “amd_sched_fence_finished()”,

            If you put the finished before free_job() and by coincidence
            the job_finish() get very soon executed you’ll have odds to
            hit wild pointer on above two cases

            BR Monk

            *From:*Koenig, Christian
            *Sent:* 2017年10月13日16:17
            *To:* Liu, Monk <monk....@amd.com>
            <mailto:monk....@amd.com>; Nicolai Hähnle
            <nhaeh...@gmail.com> <mailto:nhaeh...@gmail.com>;
            amd-gfx@lists.freedesktop.org
            <mailto:amd-gfx@lists.freedesktop.org>
            *Subject:* Re: regression with
            d6c650c0a8f6f671e49553725e1db541376d95f2

            Yeah, that change is actually incorrect and should be reverted.

            What we really need to do is remove dropping
            sched_job->s_fence from amd_sched_process_job() into
            amd_sched_job_finish() directly before the call to free_job().

            Regards,
            Christian.

            Am 13.10.2017 um 09:24 schrieb Liu, Monk:

                commit d6c650c0a8f6f671e49553725e1db541376d95f2
                Author: Nicolai Hähnle <nicolai.haeh...@amd.com>
                <mailto:nicolai.haeh...@amd.com>
                @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)

                                 fence = sched->ops->run_job(sched_job);
                                 amd_sched_fence_scheduled(s_fence);
                +
                +               /* amd_sched_process_job drops the job's
                reference of the fence. */
                +               sched_job->s_fence = NULL;
                +
                                 if (fence) {
                                         s_fence->parent =
                dma_fence_get(fence);
                                         r =
                dma_fence_add_callback(fence, &s_fence->cb,

                Hi Nicolai

                with this patch, you will break
                "amdgpu_sched_hw_job_reset()"routine:

                void

                amd_sched_hw_job_reset(struct amd_gpu_scheduler

                *sched)

                {

                     struct amd_sched_job

                *s_job;

                     spin_lock(&sched->job_list_lock);

                     list_for_each_entry_reverse(s_job,

                &sched->ring_mirror_list, node) {

                         if (s_job->s_fence->parent

                &&

                          fence_remove_callback(s_job->s_fence->parent,

                                      &s_job->s_fence->cb))

                {

                             fence_put(s_job->s_fence->parent);

                             s_job->s_fence->parent

                =

                NULL;

                             atomic_dec(&sched->hw_rq_count);

                         }

                     }

                     spin_unlock(&sched->job_list_lock);

                }

                see that without sched_job->s_fence, you cannot remove
                the callback from its hw fence,

                any idea??

                BR Monk



_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to