sched: Account entity GPU time

Tvrtko Ursulin Thu, 15 Jan 2026 04:52:17 -0800


On 15/01/2026 12:06, Danilo Krummrich wrote:

On Thu Jan 15, 2026 at 9:56 AM CET, Tvrtko Ursulin wrote:

On 14/01/2026 17:48, Danilo Krummrich wrote:

On Fri Dec 19, 2025 at 2:53 PM CET, Tvrtko Ursulin wrote:

+/**
+ * struct drm_sched_entity_stats - execution stats for an entity.
+ * @kref: reference count for the object.
+ * @lock: lock guarding the @runtime updates.
+ * @runtime: time entity spent on the GPU.
+ *
+ * Because jobs and entities have decoupled lifetimes, ie. we cannot access the


The beginning of this sentence seems slightly broken.


Suggest me an alternative because I don't see it?


Nevermind, I misread, the sentence seems fine gramatically. However,...

+ * entity once the job is completed and we know how much time it took on the


...this seems wrong. It should say something like "once the job has been taken
from the entity queue". There is no guarantee that the entity the job originated
from lives until the job is completed.


Right, the second part of the sentence is indeed confused. I'll improve it.

+ * GPU, we need to track these stats in a separate object which is then
+ * reference counted by both entities and jobs.
+ */
+struct drm_sched_entity_stats {
+       struct kref     kref;
+       spinlock_t      lock;
+       ktime_t         runtime;


We can avoid the lock entirely by using a atomic64_t instead. ktime_t is just a
typedef for s64.


Later in the series lock is needed (more members get added) so I wanted
to avoid the churn of converting the atomic64_t to ktime_t in the fair
policy patch.


Fair enough. Are those subsequently fields in some relationship with the
timestamp, i.e. do those fields need to be updated all together atomically?


Yes. First virtual runtime, then also average job duration.

+};


<snip>

+/**
+ * drm_sched_entity_stats_job_add_gpu_time - Account job execution time to 
entity
+ * @job: Scheduler job to account.
+ *
+ * Accounts the execution time of @job to its respective entity stats object.
+ */
+static inline void
+drm_sched_entity_stats_job_add_gpu_time(struct drm_sched_job *job)
+{
+       struct drm_sched_entity_stats *stats = job->entity_stats;
+       struct drm_sched_fence *s_fence = job->s_fence;
+       ktime_t start, end;
+
+       start = dma_fence_timestamp(&s_fence->scheduled);
+       end = dma_fence_timestamp(&s_fence->finished);
+
+       spin_lock(&stats->lock);
+       stats->runtime = ktime_add(stats->runtime, ktime_sub(end, start));
+       spin_unlock(&stats->lock);
+}


This shouldn't be an inline function in the header, please move to
sched_entity.c.


It is not super pretty for a static inline but it was a pragmatic choice
because it doesn't really belong to sched_entity.c. The whole entity
stats object that is. Jobs and entities have only an association
relationship to struct drm_sched_entity_stats. The only caller for this
is even in sched_main.c while other updates are done in and from sched_rq.c.


But you put drm_sched_entity_stats_release() and drm_sched_entity_stats_alloc()
into sched_entity.c as well, I don't see how that is different.

Indeed I have. Must have had a different reason back when I wrote it. Iwill move it.

Besides, the struct is called struct drm_sched_entity_stats, i.e. stats of an
entity. The documentation says "execution stats for an entity", so it clearly
belongs to entites, no?

So if pragmatic approach is not acceptable I would even rather create a
new file along the lines of sched_entity_stats.h|c. Unless that turns
out would have some other design wart of leaking knowledge of some other
part of the scheduler (ie wouldn't be fully standalone).


Given the above, please just move this function into sched_entity.c.

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index f825ad9e2260..4c10c7ba6704 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -660,6 +660,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)

job->sched = sched;

        job->s_priority = entity->priority;
+       job->entity_stats = drm_sched_entity_stats_get(entity->stats);

drm_sched_fence_init(job->s_fence, job->entity);

   }
@@ -849,6 +850,7 @@ void drm_sched_job_cleanup(struct drm_sched_job *job)
                 * been called.
                 */
                dma_fence_put(&job->s_fence->finished);
+               drm_sched_entity_stats_put(job->entity_stats);
        } else {
                /* The job was aborted before it has been committed to be run;
                 * notably, drm_sched_job_arm() has not been called.
@@ -1000,8 +1002,10 @@ static void drm_sched_free_job_work(struct work_struct 
*w)
                container_of(w, struct drm_gpu_scheduler, work_free_job);
        struct drm_sched_job *job;

- while ((job = drm_sched_get_finished_job(sched)))

+       while ((job = drm_sched_get_finished_job(sched))) {
+               drm_sched_entity_stats_job_add_gpu_time(job);


Is it really always OK to update this value in the free job work? What if a new
job gets scheduled concurrently. Doesn't this hurt accuracy, since the entity
value has not been updated yet?


What exactly you mean by entity value?

If a new job gets scheduled concurrently then it is either just about to
run, still running, both of which are not relevant for this finished
job, and once finished will also end up here to have it's duration
accounted against the stats.


So, what I mean is that the timeframe between a running job's fence being
signaled due to completion and the this same job is being freed in the free job
work by the driver can be pretty big.

In the meantime the scheduler might have to take multiple decisions on which
entity is next to be scheduled. And by calling
drm_sched_entity_stats_job_add_gpu_time() in drm_sched_job_cleanup() rather than
when it's finished fence is signaled we give up on accuracy in terms of
fairness, while fairness is the whole purpose of this scheduling approach.

Right, so yes, the entity runtime lags the actual situation by the delaybetween scheduling and running the free worker.

TBH now the problem is I wrote this so long ago that I don't evenremember what was the reason I moved this from the job done callback tothe finished worker. Digging through my branches it happened duringApril '25. I will try to remind myself while I am making other tweaks.

But in principle, I am not too concerned with this. In practice thisdelay isn't really measurable and for actual fairness much, much, biggerissue is the general lack of preemption in many drivers, coupled withsubmitting more than one job per entity at a time. In that sense thealgorithm is much fairer that FIFO or RR, but does not aim or promisecomplete fairness. Main thing is that it appears better or same thanboth FIFO and RR at all workloads, kind of like a best of both worldswith extra qualities on top. And simplifies the code base at the same time.


Regards,

Tvrtko

Re: [PATCH v5 07/28] drm/sched: Account entity GPU time

Reply via email to