On Mon, May 26, 2025 at 02:54:44PM +0200, Pierre-Eric Pelloux-Prayer wrote:
drivers/gpu/drm/xe/xe_sched_job.c                |  3 ++-
diff --git a/drivers/gpu/drm/xe/xe_sched_job.c 
b/drivers/gpu/drm/xe/xe_sched_job.c
index f0a6ce610948..5921293b25db 100644
--- a/drivers/gpu/drm/xe/xe_sched_job.c
+++ b/drivers/gpu/drm/xe/xe_sched_job.c
@@ -113,7 +113,8 @@ struct xe_sched_job *xe_sched_job_create(struct 
xe_exec_queue *q,
        kref_init(&job->refcount);
        xe_exec_queue_get(job->q);

-       err = drm_sched_job_init(&job->drm, q->entity, 1, NULL);
+       err = drm_sched_job_init(&job->drm, q->entity, 1, NULL,
+                                q->xef->drm->client_id);

you can't do this here. xef is only !NULL if it's a job from userspace.
For in-kernel jobs, xef is NULL and this explodes. Right now this
completely breaks xe since one of the very first things we do is
to submit a job to save the default context. Example:
https://intel-gfx-ci.01.org/tree/intel-xe/xe-3151-56d2b14961751a677ff1f7ff8b93a6c814ce2be3/bat-bmg-1/igt@xe_module_l...@load.html

        <4> [] RIP: 0010:xe_sched_job_create+0xbd/0x390 [xe]
        <4> [] Code: c1 43 18 85 c0 0f 84 6f 02 00 00 8d 50 01 09 c2 0f 88 3e 02 00 
00 48 8b 03 48 8b b3 d8 00 00 00 31 c9 4c 89 ef ba 01 00 00 00 <48> 8b 40 08 4c 8b 40 
60 e8 86 64 7c ff 41 89 c4 85 c0 0f 85 9b 01
        <4> [] RSP: 0018:ffffc900031972d8 EFLAGS: 00010246
        <4> [] RAX: 0000000000000000 RBX: ffff88815fc40d00 RCX: 0000000000000000
        <4> [] RDX: 0000000000000001 RSI: ffff88812e6552a8 RDI: ffff88815f939c40
        <4> [] RBP: ffffc90003197318 R08: 0000000000000000 R09: 0000000000000000
        <4> [] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003197428
        <4> [] R13: ffff88815f939c40 R14: ffff88811f054000 R15: ffff88815fc40d00
        <4> [] FS:  00007681f2948940(0000) GS:ffff8888daf14000(0000) 
knlGS:0000000000000000
        <4> [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        <4> [] CR2: 0000000000000008 CR3: 0000000118315004 CR4: 0000000000f72ef0
        <4> [] PKRU: 55555554
        <4> [] Call Trace:
        <4> []  <TASK>
        <4> []  __xe_bb_create_job+0xa2/0x240 [xe]
        <4> []  ? find_held_lock+0x31/0x90
        <4> []  ? xa_find_after+0x12c/0x250
        <4> []  xe_bb_create_job+0x6e/0x380 [xe]
        <4> []  ? xa_find_after+0x136/0x250
        <4> []  ? __drm_dev_dbg+0x7d/0xb0
        <4> []  xe_gt_record_default_lrcs+0x542/0xb00 [xe]

Can we use 0 for in-kernel client since drm_file starts them from 1?
Like this:

| diff --git a/drivers/gpu/drm/xe/xe_sched_job.c 
b/drivers/gpu/drm/xe/xe_sched_job.c
| index 5921293b25db3..d21bf8f269640 100644
| --- a/drivers/gpu/drm/xe/xe_sched_job.c
| +++ b/drivers/gpu/drm/xe/xe_sched_job.c
| @@ -114,7 +114,7 @@ struct xe_sched_job *xe_sched_job_create(struct 
xe_exec_queue *q,
|         xe_exec_queue_get(job->q);
| | err = drm_sched_job_init(&job->drm, q->entity, 1, NULL,
| -                                q->xef->drm->client_id);
| +                                q->xef ? q->xef->drm->client_id : 0);
|         if (err)
|                 goto err_free;

I tested with the above diff and it at least loads...

Also, I see this in intel-xe mailing list, but I'm not sure why we
didn't have any CI results... I will check that.

Lucas De Marchi

Reply via email to