On 3/23/2025 8:09 PM, Chris Bainbridge wrote:
There is a reference couting / lifecycle issue with drm_connector when used
with a USB-C dock. The problem has been previously reproduced on both Intel and
AMD GPUs.

On both Intel and AMD, the symptoms are:

   - multiple connectors being listed in sysfs `sys/class/drm/cardX/` (because
     the old connectors are not removed when the dock is unplugged)
   - no display on the external monitors.
   - "Payload for VCPI 1 not in topology, not sending remove" error if drm.debug
     is enabled

On AMD, this issue is the root cause of a number of errors when re-plugging in
a dock:

   - *ERROR* Failed to get ACT after 3000ms
   - kernel NULL pointer dereference calling setcrtc
   - UBSAN: shift-out-of-bounds in drivers/gpu/drm/display/drm_dp_mst_topology.c
   - use-after-free in dc_stream_release
   - refcount_t: underflow; use-after-free.
   - slab-use-after-free in event_property_validate
   - WARNING display/dc/dcn21/dcn21_link_encoder.c:215 
dcn21_link_encoder_acquire_phy
   - Part 1 of payload creation for DP-2 failed, skipping part 2
   - probably most bug reports relating to suspend/resume and a dock

This bug has been reproduced on both Ubuntu/Gnome and Debian/XFCE. The symptoms
are intermittent and vary (as above), but the consistent initial symptom is
multiple connectors being listed in sysfs.

To reproduce, annotate drm_dp_delayed_destroy_port with something like:

--- a/drivers/gpu/drm/display/drm_dp_mst_topology.c
+++ b/drivers/gpu/drm/display/drm_dp_mst_topology.c
@@ -5014,6 +5014,9 @@ drm_dp_delayed_destroy_port(struct drm_dp_mst_port *port)
if (port->connector) {
                 drm_connector_unregister(port->connector);
+               printk("drm_dp_delayed_destroy_port %s refcount=%d\n",
+                               port->connector->name,
+                               kref_read(&port->connector->base.refcount));
                 drm_connector_put(port->connector);
         }
Boot laptop with dock connected, activate external monitors, suspend, unplug
the dock, and resume. This problem is intermittent, so these steps may need to
be repeated. But when the problem is hit, the drm_dp_mst_port will be
destroyed, but the drm_connector will still be alive. (This can also be
reproduced with just plugging and unplugging without suspend/resume, but, on my
laptop, it happens almost every time with suspend/resume).

The cause of this problem appears to be:

   - calling setcrtc to enable a CRTC results in the drm_connector refcount
     being incremented:
   - drm_atomic_get_connector_state appears to add connectors into
     drm_atomic_state->connectors, and increments the refcount

   - on disabling the external monitors, a call to drm_mode_setcrtc results in
     the drm_connector being destroyed via call chain:

     amdgpu_drm_ioctl
       drm_ioctl
         drm_ioctl_kernel
           drm_mode_setcrtc (via func)
             drm_atomic_helper_set_config (via crtc->funcs->set_config)
               drm_atomic_state_put
                 __drm_atomic_state_free (via kref_put)
                   drm_atomic_state_clear
                     drm_atomic_state_default_clear
                       drm_connector_put
                         drm_mode_object_put
                           drm_connector_free (via ->free_cb put destroyer)
                             dm_dp_mst_connector_destroy

   - so the drm_connector is not destroyed until/if userspace calls setcrtc to
     clear the CRTC (set.num_connectors=0). If this does not happen for whatever
     reason (userspace process is terminated, frozen due to suspend, etc.) then
     the drm_connector object will still be alive even though the corresponding
     drm_dp_mst_port is dead.

   - in normal usage, drm_connector_cleanup releases the connector ID:

     ida_free(&dev->mode_config.connector_ida, connector->index);

   - when dock is replugged, a connector ID is allocated:

     connector->connector_type_id = ida_alloc_min(connector_ida, 1, GFP_KERNEL);

   - if setcrtc has not been called to free the old ID, then ida_alloc_min
     allocates a new connector ID instead of reusing the old one. This explains
     the "multiple connectors being listed in sysfs" problem.

   - the other problems occur after this, due to the multiple half-dead
     connector objects.

   - UBSAN: shift-out-of-bounds in 
drivers/gpu/drm/display/drm_dp_mst_topology.c:4568
     occurs because vcpi==0 in this payload, so BIT op is a left-shift by -1.

   - slab-use-after-free in event_property_validate: looks like it happens
     because hdcp_update_display, hdcp_remove_display copy references to
     amdgpu_dm_connector (which contains a nested drm_connector) in to the
     delayed_work struct hdcp_workqueue without incrementing the reference count
     (see pair of lines "hdcp_w->aconnector[conn_index] = aconnector;").
     If the connector is freed, &aconnector[conn_index] will become a dangling
     pointer. Actually, I can reproduce this easily by just booting to
     gdm then plugging and unplugging the dock a few times, so it's
     possible this is an independent issue that also needs fixing.

   - use-after-free in dc_stream_release - there appears to be a few
     points where a dc_stream_state pointer is copied without refcounting
     ("pipe_ctx->stream = stream;") but I don't know if this is the problem. It
     could also just be that earlier failures have left something in a bad 
state.

I'm unsure of the best approach to fix the root cause. One way is to try
and release the references by disabling the CRTC. I tried calling
drm_mode_crtc from drm_dp_delayed_destroy_port. This was a bit hacky,
but did seem to work, the reference count got reduced to 0, and the
drm_connector was destroyed. Another option would be to call the
drm_connector destructor from drm_dp_delayed_destroy_port (protected by
some mutex so that it doesn't get called twice when the actual refcount
goes to 0) - that might work to free up the connector ID, but I suspect
there could be other issues with having the drm_connector object still
alive and potentially holding references to other objects, even though
the dock has been physically disconnected.

Thanks for the thorough analysis and all your thoughts here. It sounds like fixing this is going to resolve quite a large number of issues.

Working up your call path, drm_dp_destroy_port() is what queues the delayed destroy work and specifically has a comment about holding the mode config mutex from EDID retrieval being the reason that it can't be destroyed immediately.

Perhaps could this be solved by reading EDID immediately when the device
shows up and caching it for all callers? Then this could be an immediate destroy.

Reply via email to