On Fri, Jul 04, 2025 at 02:27:02PM +0200, Christian König wrote: > On 04.07.25 14:00, Jani Nikula wrote: > > On Fri, 04 Jul 2025, Aakash Deep Sarkar <aakash.deep.sar...@intel.com> > > wrote: > >> dma_fence_wait_timeout returns a long type but the driver is > >> only using the lower 32 bits of the retval and discarding the > >> upper 32 bits. > >> > >> This is particularly problematic if there are already signalled > >> or stub fences on some of the hw planes. In this case the > >> dma_fence_wait_timeout function will immediately return with > >> timeout value MAX_SCHEDULE_TIMEOUT (0x7fffffffffffffff) since > >> the fence is already signalled. > > That is not correct. If the fence is signaled dma_fence_wait_timeout() > returns the remaining timeout or at least 1 if the input timeout was 0. >
>From my look this looks correct. > Could be that i915 has a separately implemented fence_ops->wait callback > which incorrectly returns MAX_SCHEDULE_TIMEOUT, but i strongly doubt that > because that would break tons of stuff. > > > If the driver only uses the lower > >> 32 bits of this return value then it'll interpret it as an error > >> code (0xFFFFFFFF or (-1)) and skip the wait on the remaining fences. > >> > >> This issue was first observed with the Android compositor where > >> the GPU composited layer was not properly waited on when there > >> were stub fences in other overlay planes resulting in significant > >> visual artifacts. > > > > Thanks for the patch, good catch! > > If I'm not completely mistaken this patch is not necessary. > > dma_fence_wait_timeout() does *not* return MAX_SCHEDULE_TIMEOUT when the > fence is signaled, but rather the remaining timeout which is the input value > at maximum. > > So when the input timeout fits into 32bits the returned timeout also fits > into 32bits as well. > Sure, but the input here is MAX_SCHEDULE_TIMEOUT, which is not 32 bits. The interface accepts a 32-bit timeout input, so the return value can be greater than 32 bits. > And as far as I can see i915 is using the config option > DRM_I915_FENCE_TIMEOUT here. So as long as you don't set this to very very > large value this should work as expected. > > >> Test: No graphical artifacts with shadertoy > > > > We've never used this commit trailer before, please let's not start now. > > > > The subject should plainly state the "what", and the commit message > > should answer the "why". You do have that here, but I think the subject > > is still a bit too much nuts and bolts. > > > > For example, > > > > drm/i915/display: Fix dma_fence_wait_timeout() return value handling > > > > would state the "what" in *much* more helpful terms for anyone looking > > at git logs. > > > > Presumably this has been an error for some time. You should do a little > > bit of git blame on the offending lines. It'll find commit d59cf7bb73f3 > > ("drm/i915/display: Use dma_fence interfaces instead of i915_sw_fence"). > > > > Based on that, we should add: > > > > Fixes: d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces instead of > > i915_sw_fence") > > Cc: Ville Syrjälä <ville.syrj...@linux.intel.com> > > Cc: Maarten Lankhorst <maarten.lankho...@linux.intel.com> > > Cc: Jouni Högander <jouni.hogan...@intel.com> > > Cc: <sta...@vger.kernel.org> # v6.8+ > > > > Then it occurs to me this looks like a common mistake to make. A little > > bit of git grep on dma_fence_wait_timeout() quickly finds multiple > > similar mistakes in drm, at least amdgpu, etnaviv, msm, and tegra. Cc > > some maintainers FYI. This class of bugs could cause issues elsewhere. > > I can only guess but I think all those use cases use a fixed small timeout as > well. IIRC amdgpu uses a timeout in the millisecond range. > Ah, no. A quick grep shows multiple cases in AMDGPU where long + MAX_SCHEDULE_TIMEOUT is used: - amdgpu_vm_cpu_update - amdgpu_vm_wait_idle - amdgpu_bo_kmap - amdgpu_hmm_invalidate_gfx - amdgpu_gem_wait_idle_ioctl I'm pretty confused by the pushback here—it's among the most confusing I've ever seen. Again, this patch is almost certainly correct. I've made this mistake at least twice myself. > > Let's also Cc Julia and Dan in case they have ideas to improve static > > analysis to catch this class of bugs. Or maybe one exists already, but > > we're just not running them! > > Yeah, agree. A script which checks if the input timeout could be >32bit and > the return value is assigned to something <=32bits is probably a really good > idea. > Yes, I agree. -Wstrict-overflow or -Wconversion in GCC would catch this, but enabling them causes the core kernel includes to quickly explode. A static checker would be a good solution here, or we could fix the Linux kernel so it compiles cleanly with these options. Matt > Regards, > Christian. > > > > > Finally, for the actual change, > > > > Reviewed-by: Jani Nikula <jani.nik...@intel.com> > > > > > >> Signed-off-by: Aakash Deep Sarkar <aakash.deep.sar...@intel.com> > >> --- > >> drivers/gpu/drm/i915/display/intel_display.c | 3 ++- > >> 1 file changed, 2 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/i915/display/intel_display.c > >> b/drivers/gpu/drm/i915/display/intel_display.c > >> index 456fc4b04cda..7035c1fc9033 100644 > >> --- a/drivers/gpu/drm/i915/display/intel_display.c > >> +++ b/drivers/gpu/drm/i915/display/intel_display.c > >> @@ -7092,7 +7092,8 @@ static void intel_atomic_commit_fence_wait(struct > >> intel_atomic_state *intel_stat > >> struct drm_i915_private *i915 = to_i915(intel_state->base.dev); > >> struct drm_plane *plane; > >> struct drm_plane_state *new_plane_state; > >> - int ret, i; > >> + long ret; > >> + int i; > >> > >> for_each_new_plane_in_state(&intel_state->base, plane, new_plane_state, > >> i) { > >> if (new_plane_state->fence) { > > >