I was pointed to this by a friend of mine (Mayank Rungta) whom I helped
crack a radeon driver OOPs on suspend: (bug 820746)

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/820746/

For some reason, he wanted me to take a look at this issue as well which
has been inactive for sometime :)

Detailed RCA and some action items for the guy who reproduced this. Its
evident that he had 3 connectors (or displays) attached to the radeon
driver during boot time and might be also suspending with them. (LVDS
laptop display + VGA connector display + HDMI connector attached). Need
to know how it was reproduced or whether laptop lid was closed/suspended
with all 3 connectors attached. So I can ask my friend with a similar
hardware and radeon driver (same guy who reproduced 820746) to reproduce
this.

Read ahead for the full story:
=====================

Again using the objdump disassembly of the radeon driver (radeon.ko.out)
attached from bug 820746 (same 2.6.38), I managed to crack the place
that's causing the OOPs.

Reverse engineering the OOPs to the assembly and mapping the assembly to
C code, the panic was triggered by this instruction on radeon driver
suspend in radeon_suspend_kms:

radeon_suspend_kms.c:

  /* turn off display hw */
list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
drm_helper_connector_dpms(connector, DRM_MODE_DPMS_OFF);
}

  In the above list_head iteration of connector_list for the radeon
drm_device on SUSPEND, the dev->mode_config.connector_list.next is NULL.

Or in other words, the DRM device connector list is _corrupted_. Its
mostly certain that the device connector was detached or destroyed while
suspend is trying to switch off display on all your connectors.

dev->mde_config.connector_list.next is NULL
 the panic or faulting instruction EIP was triggered by a NULL in register EBX 
  EBX value is 0xfffffea8
  which is nothing but:   NULL pointer - 0x158,
 which is nothing but: ~0U - 0x157.

  The OOPs EIP is at:
  radeon_suspend_kms+0x78

which from objdump disassembly maps to:

which is radeon_suspend_kms + 19888
  19888: 8b 83 58 01 00 00 mov 0x158(%ebx),%eax
  bingo:
  as thats a list_entry macro trying to iterator "struct drm_connector" or drm 
connector list. The drm connector list head field is at offset 0x158 which has 
to be subtracted from the list_head pointer to arrive at the drm_connector.

So at panic time, the radeon driver OOps while trying to suspend display on 
each of the connected devices.
But the connector list was corrupted.

Also the OOPs hexdump exactly matches the objdump dissassembly hexdump at the 
time of the panic:
81 eb 58 01 00 00 <8b> 83 58 01 00 00 0f
<8b> (angular brackets) is the faulting instruction or the "mov".

This matches the list_head walk for the drm connector from the objdump 
disassembly of radeon_suspend_kms function:
  19882:   81 eb 58 01 00 00       sub    $0x158,%ebx
   19888:   8b 83 58 01 00 00       mov    0x158(%ebx),%eax ->PANIC EIP is here.


Now that we know the C code and the reason of the Oops or the null pointer 
field, we have to trace backwards in code and see how the drm connector list 
can be corrupted or can have NULL as a list element or a corrupted connector 
element in the drm_connector list.

I cross-checked that there is only one place where the connector can be
destroyed which is in drm_mode_config_cleanup which is called only on
radeon unload. And this kind of corruption can typically happen if the
code tries to use list_entry_for_each instead of
list_entry_for_each_safe while detaching each of the entries in the list
-> in this case the radeon device drm connectors.

But its seen that the connector destroy or detach for radeon:
radeon_connector_destroy or radeon_dp_connector_destroy invoke
drm_connector_cleanup which correctly removes the connector from the
list with list_del before freeing it. So it isn't obvious as well since
the code does seem to be safe w.r.t removing or detaching each of the
displays/connectors attached to the display.(radeon in this case)

So I am not sure if we are hitting a race condition with suspend trying
to switch off display on each of your connectors whilst the radeon
driver is getting unloaded parallely. (race condition since the switch
off or suspend code doesn't take the dev->mode config mutex for the
walk). So its possible that its a race.

I found from your boot time dmesg that you had 3 displays attached. (laptop 
LVDS + VGA + HDMI).
So please let us know the reproduction scenario and whether you tried to 
suspend with all 3 connections or you pulled one of the connectors (disabled) 
and then suspended/hibernated your laptop or you were shutting down (not sure)

The panic or the problem is because of a corruption in the management of
the connectors to the display from the drm/radeon driver code. So it
would help us narrow down the cause to the culprit or help us re-create
and go further.

Just to let you know that I am not a  video driver expert by any means
but at least I am possessed with a good ability to debug and hence
volunteered to take a stab at this issue based on my friends request. So
even if the true cause behind the connector corruption in the display
driver isn't found, don't take it to heart :)

[   11.572380] [drm] Radeon Display Connectors
[   11.572384] [drm] Connector 0:
[   11.572387] [drm]   LVDS
[   11.572390] [drm]   DDC: 0x7f68 0x7f68 0x7f6c 0x7f6c 0x7f70 0x7f70 0x7f74 
0x7f74
[   11.572393] [drm]   Encoders:
[   11.572395] [drm]     LCD1: INTERNAL_UNIPHY2
[   11.572397] [drm] Connector 1:
[   11.572399] [drm]   VGA
[   11.572402] [drm]   DDC: 0x7e40 0x7e40 0x7e44 0x7e44 0x7e48 0x7e48 0x7e4c 
0x7e4c
[   11.572405] [drm]   Encoders:
[   11.572407] [drm]     CRT1: INTERNAL_KLDSCP_DAC1
[   11.572409] [drm] Connector 2:
[   11.572411] [drm]   HDMI-A
[   11.572412] [drm]   HPD1
[   11.572415] [drm]   DDC: 0x7e50 0x7e50 0x7e54 0x7e54 0x7e58 0x7e58 0x7e5c 
0x7e5c
[   11.572418] [drm]   Encoders:
[   11.572420] [drm]     DFP1: INTERNAL_UNIPHY

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/756555

Title:
  BUG: unable to handle kernel NULL pointer dereference at   (null)
  radeon_suspend_kms+0x78/0x1e0 [radeon] from
  radeon_switcheroo_set_state+0x4b/0xa0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/756555/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to