On Fri, 2025-12-05 at 10:19 +0100, Christian König wrote:
> On 12/4/25 17:04, Alex Deucher wrote:
> > On Wed, Dec 3, 2025 at 4:24 AM Philipp Stanner <[email protected]> wrote:
> > > 
> > > +Cc Alex, Christian, Danilo
> > > 
> > > 
> > > On Mon, 2025-12-01 at 10:39 -0800, Matthew Brost wrote:
> > > > Stop open coding pending job list in drivers. Add pending job list
> > > > iterator which safely walks DRM scheduler list asserting DRM scheduler
> > > > is stopped.
> > > > 
> > > > v2:
> > > >  - Fix checkpatch (CI)
> > > > v3:
> > > >  - Drop locked version (Christian)
> > > > v4:
> > > >  - Reorder patch (Niranjana)
> > > 
> > > Same with the changelog.
> > > 
> > > > 
> > > > Signed-off-by: Matthew Brost <[email protected]>
> > > > Reviewed-by: Niranjana Vishwanathapura 
> > > > <[email protected]>
> > > > ---
> > > >  include/drm/gpu_scheduler.h | 50 +++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 50 insertions(+)
> > > > 
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index 385bf34e76fe..9d228513d06c 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -730,4 +730,54 @@ static inline bool 
> > > > drm_sched_job_is_signaled(struct drm_sched_job *job)
> > > >               dma_fence_is_signaled(&s_fence->finished);
> > > >  }
> > > > 
> > > > +/**
> > > > + * struct drm_sched_pending_job_iter - DRM scheduler pending job 
> > > > iterator state
> > > > + * @sched: DRM scheduler associated with pending job iterator
> > > > + */
> > > > +struct drm_sched_pending_job_iter {
> > > > +     struct drm_gpu_scheduler *sched;
> > > > +};
> > > > +
> > > > +/* Drivers should never call this directly */
> > > > +static inline struct drm_sched_pending_job_iter
> > > > +__drm_sched_pending_job_iter_begin(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +     struct drm_sched_pending_job_iter iter = {
> > > > +             .sched = sched,
> > > > +     };
> > > > +
> > > > +     WARN_ON(!drm_sched_is_stopped(sched));
> > > > +     return iter;
> > > > +}
> > > > +
> > > > +/* Drivers should never call this directly */
> > > > +static inline void
> > > > +__drm_sched_pending_job_iter_end(const struct 
> > > > drm_sched_pending_job_iter iter)
> > > > +{
> > > > +     WARN_ON(!drm_sched_is_stopped(iter.sched));
> > > > +}
> > > > +
> > > > +DEFINE_CLASS(drm_sched_pending_job_iter, struct 
> > > > drm_sched_pending_job_iter,
> > > > +          __drm_sched_pending_job_iter_end(_T),
> > > > +          __drm_sched_pending_job_iter_begin(__sched),
> > > > +          struct drm_gpu_scheduler *__sched);
> > > > +static inline void *
> > > > +class_drm_sched_pending_job_iter_lock_ptr(class_drm_sched_pending_job_iter_t
> > > >  *_T)
> > > > +{ return _T; }
> > > > +#define class_drm_sched_pending_job_iter_is_conditional false
> > > > +
> > > > +/**
> > > > + * drm_sched_for_each_pending_job() - Iterator for each pending job in 
> > > > scheduler
> > > > + * @__job: Current pending job being iterated over
> > > > + * @__sched: DRM scheduler to iterate over pending jobs
> > > > + * @__entity: DRM scheduler entity to filter jobs, NULL indicates no 
> > > > filter
> > > > + *
> > > > + * Iterator for each pending job in scheduler, filtering on an entity, 
> > > > and
> > > > + * enforcing scheduler is fully stopped
> > > > + */
> > > > +#define drm_sched_for_each_pending_job(__job, __sched, __entity)       
> > > >       \
> > > > +     scoped_guard(drm_sched_pending_job_iter, (__sched))               
> > > >       \
> > > > +             list_for_each_entry((__job), &(__sched)->pending_list, 
> > > > list)    \
> > > > +                     for_each_if(!(__entity) || (__job)->entity == 
> > > > (__entity))
> > > > +
> > > >  #endif
> > > 
> > > 
> > > See my comments in the first patch. The docu doesn't mention at all why
> > > this new functionality exists and when and why users would be expected
> > > to use it.
> > > 
> > > As far as I remember from XDC, both AMD and Intel overwrite a timed out
> > > jobs buffer data in the rings on GPU reset. To do so, the driver needs
> > > the timedout job (passed through timedout_job() callback) and then
> > > needs all the pending non-broken jobs.
> > > 
> > > AFAICS your patch provides a generic iterator over the entire
> > > pending_list. How is a driver then supposed to determine which are the
> > > non-broken jobs (just asking, but that needs to be documented)?
> > > 
> > > Could it make sense to use a different iterator which only returns jobs
> > > of not belonging to the same context as the timedout-one?
> > > 
> > > Those are important questions that need to be addressed before merging
> > > that.
> > > 
> > > And if this works canonically (i.e., for basically everyone), it needs
> > > to be documented in drm_sched_resubmit_jobs() that this iterator is now
> > > the canonical way of handling timeouts.
> > > 
> > > Moreover, btw, just yesterday I added an entry to the DRM todo list
> > > which addresses drm_sched_resubmit_jobs(). If we merge this, that entry
> > > would have to be removed, too.
> > > 
> > > 
> > > @AMD: Would the code Matthew provides work for you? Please give your
> > > input. This is very important common infrastructure.
> > 
> > I don't think drm_sched_resubmit_jobs() can work for us without major
> > rework.  For our kernel queues, we have a single queue on which jobs
> > for different clients are scheduled.  When we reset the queue, we lose
> > all jobs on the queue and have to re-emit the non-guilty ones.  We do
> > this at the ring level, i.e., we save the packets directly from the
> > ring and then re-emit the packets for the non-guilty contexts to the
> > freshly reset ring.  This avoids running run_job() again which would
> > issue new fences and race with memory management, etc.
> > 
> > I think the following would be workable:
> > 1. driver job_timedout() callback flags the job as bad. resets the bad
> > queue, and calls drm_sched_resubmit_jobs()
> > 2. drm_sched_resubmit_jobs() walks the pending list and calls
> > run_job() for every job
> 
> Calling run_job() multiple times was one of the worst ideas I have ever seen.
> 
> The problem here is that you need a transactional approach to the internal 
> driver state which is modified by ->run_job().
> 
> So for example if you have:
> ->run_job(A)
> ->run_job(B)
> ->run_job(C)
> 
> And after a reset you find that you need to re-submit only job B and A & C 
> are filtered then that means that your driver state needs to get back before 
> running job A.
> 
> > 2. driver run_job() callback looks to see if we already ran this job
> > and uses the original fence rather than allocating a new one
> 
> Nope, the problem is *all* drivers *must* use the original fence. Otherwise 
> you always run into trouble.
> 
> We should not promote a driver interface which makes it extremely easy to 
> shoot down the whole system.
> 
> > 3. driver run_job() callback checks to see if the job is guilty or
> > from the same context and if so, sets an error on the fences and
> > submits only the fence packet to the queue so that any follow up jobs
> > will properly synchronize if they need to wait on the fence from the
> > bad job.
> > 4. driver run_job() callback will submit the full packet stream for
> > non-guilty contexts
> > 
> > I guess we could use the iterator and implement that logic in the
> > driver directly rather than using drm_sched_resubmit_jobs().
> 
> Yeah, exactly that's the way to go.

Sorry, I guess my message was confusing.

I don't mean anyone to use drm_sched_resubmit_jobs() at all. That
function is deprecated, and since there are still users, we can't
easily modify it.
I just mentioned that function because its docstring should inform
about what users should do *instead* of calling that function.
Currently, it's just marked as deprecated without providing users with
an alternative.

Anyways.
So can you take a look at Matthew's iterator and see if that will work
for you?

If we'd end up with an at least mostly-generic solution for resubmits
that's in use in both Xe and amdgpu, that would be a huge leap forward.


P.

> 
> Christian.
> 
> > 
> > Alex
> 

Reply via email to