Hi Alan, Changbin, Xenia,
I've twice encountered a crash on system reboot in usb_disable_device
that looks to be a bos descriptor use-after-free.
The machine in question is running a 4.5-rc5 kernel with proprietary and
out-of-tree Stratus drivers that facilitate device-removal testing --
these drivers give us the ability to automate/simulate device removal of
PCI devices at any time. None of these affect any of the USB drivers or
stack directly, they mainly consist of platform-aware features, PCI
hotplug calls and a few fixups in storage/network land to better survive
the testing.
Additional debugging features were enabled: slub_debug=FZPU and the
following build options:
CONFIG_DEBUG_OBJECTS=y
CONFIG_DEBUG_OBJECTS_FREE=y
CONFIG_DEBUG_OBJECTS_TIMERS=y
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
I was able to scrape a dump from the crash, its backtrace looks like:
[ 4095.109442] usb 6-1: device descriptor read/8, error -22
[ 4095.240435] usb 6-1: device descriptor read/8, error -22
PID: 18692 TASK: ffff881034229640 CPU: 9 COMMAND: "kworker/9:0"
#0 [ffff880a9907f9c8] machine_kexec at ffffffff8105c97b
#1 [ffff880a9907fa28] __crash_kexec at ffffffff811097f2
#2 [ffff880a9907faf8] crash_kexec at ffffffff811098df
#3 [ffff880a9907fb10] oops_end at ffffffff8101ba89
#4 [ffff880a9907fb38] die at ffffffff8101bfab
#5 [ffff880a9907fb68] do_general_protection at ffffffff81019289
#6 [ffff880a9907fb90] general_protection at ffffffff81686948
[exception RIP: usb_disable_ltm+0x4c]
RIP: ffffffff8149210c RSP: ffff880a9907fc48 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88103756dcd0 RCX: 0000000100300014
RDX: 6b6b6b6b6b6b6b6b RSI: ffffea0040c3ac00 RDI: ffff88103756dcd0
RBP: ffff880a9907fc80 R8: ffff881030eb1110 R9: 0000000000000000
R10: ffff881030eb2df0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88103617c290 R14: 0000000000000001 R15: 0000000000000012
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880a9907fc48] usb_disable_device at ffffffff8149f0fa
#8 [ffff880a9907fc88] usb_disconnect at ffffffff81494c74
#9 [ffff880a9907fcd8] hub_port_connect at ffffffff81496790
#10 [ffff880a9907fd58] hub_event at ffffffff8149767c
#11 [ffff880a9907fe18] process_one_work at ffffffff8109862f
#12 [ffff880a9907fe60] worker_thread at ffffffff81098f15
#13 [ffff880a9907fec8] kthread at ffffffff8109ea88
#14 [ffff880a9907ff50] ret_from_fork at ffffffff81684a0f
The crash occurred when dereferencing RDX, which is set to the
slub_debug poison pattern. The code in question:
crash> dis -l usb_disable_ltm+0x4c
drivers/usb/core/hub.c: 2976
<usb_disable_ltm+0x4c>: testb $0x2,0x3(%rdx)
drivers/usb/core/hub.c :: usb_disable_ltm :
2973 struct usb_hcd *hcd = bus_to_hcd(udev->bus);
2974
2975 /* Check if the roothub and device supports LTM. */
2976 if (!usb_device_supports_ltm(hcd->self.root_hub) ||
2977 !usb_device_supports_ltm(udev))
include/linux/usb.h :: usb_device_supports_ltm :
static inline bool usb_device_supports_ltm(struct usb_device *udev)
{
if (udev->speed != USB_SPEED_SUPER || !udev->bos ||
!udev->bos->ss_cap)
return false;
return udev->bos->ss_cap->bmAttributes & USB_LTM_SUPPORT;
}
Running through a few of the data structure offsets, it appears that
we're in usb_device_supports_ltm and referencing off udev->bos->ss_cap:
crash> struct -o usb_ss_cap_descriptor | grep bmAttributes
[0x3] __u8 bmAttributes;
>From the disassembly of usb_device_supports_ltm, RDI looks untouched, so
it's probably a struct usb_device. It's kobj name is "6-1", which
matches the usb device that had just reported an error in the kernel
log.
crash> struct usb_device ffff88103756dcd0 | grep name
name = 0xffff88102aab8540 "6-1",
Now to determine which invocation of usb_device_supports_ltm is bad.
Chasing hcd->self.root_hub ...
crash> struct usb_device ffff88103756dcd0 | grep bus
bus = 0xffff88103617c290,
crash> struct -o usb_hcd | grep bus
[0x0] struct usb_bus self;
crash> struct usb_hcd 0xffff88103617c290 | grep root_hub
root_hub = 0xffff881034ad0000,
crash> struct usb_device 0xffff881034ad0000 | grep bos
bos = 0xffff881012d37a30,
crash> struct usb_host_bos 0xffff881012d37a30 | grep ss_cap
ss_cap = 0xffff88101af89d95,
crash> struct usb_ss_cap_descriptor 0xffff88101af89d95 | grep bmAttributes
bmAttributes = 0x2,
That looks reasonable.
How about udev ...
crash> struct usb_device 0xffff88103756dcd0 | grep bos
bos = 0xffff8810343a61b0,
crash> struct usb_host_bos 0xffff8810343a61b0 | grep ss_cap
ss_cap = 0x6b6b6b6b6b6b6b6b,
Not so good. Crash utility confirms that the bos pointer is not
currently allocated:
crash> kmem 0xffff8810343a61b0
CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS
SSIZE
ffff88103f40f800 kmalloc-64 64 80423 87781 2141
16k
SLAB MEMORY NODE TOTAL ALLOCATED FREE
ffffea0040d0e900 ffff8810343a4000 0 41 11 30
FREE / [ALLOCATED]
ffff8810343a61b0
PAGE PHYSICAL MAPPING INDEX CNT FLAGS
ffffea0040d0e980 10343a6000 dead000000000400 0 0 2fffff80000000
This is odd since usb_release_bos_descriptor not only frees bos, but
also sets to NULL.
Before crash, two errors were logged:
usb 6-1: device descriptor read/8, error -22
which can only happen in hub_port_init. One of hub_port_init's callers
is usb_reset_and_verify_device, which interestingly enough can reload
udev->bos like so:
usb_reset_and_verify_device
bos = udev->bos [A]
for (i = 0; i < SET_CONFIG_TRIES; ++i)
ret = hub_port_init
"usb 6-1: device descriptor read/8, error -22" [B]
if (ret < 0)
goto re_enumerate;
re_enumerate:
usb_release_bos_descriptor(udev)
kfree(dev->bos) [C]
dev->bos = NULL
udev->bos = bos [D]
re_enumerate_no_bos:
hub_port_logical_disconnect(parent_hub, port1)
return -ENODEV
A - cache the old udev->bos
B - fail to set a new udev->bos
C - release udev->bos, set udev->bos to NULL
D - set udev->bos to the cached value from A
I'm following a hunch here, but I suspect that reboot might have
stumbled into an issue inside hub_port_init and then
usb_reset_and_verify_device accidentally restored udev->bos to a now
freed pointer value.
There's a bit of git history around usb_reset_and_verify_device and the
bos pointer in particular:
e3376d6c87ee usbcore: compare and release one bos descriptor in
usb_reset_and_verify_device()
Author: Xenia Ragiadakou <[email protected]>
Suggested-by: Alan Stern <[email protected]>
Date: Thu Aug 29 21:45:48 2013 +0300
(Fixed a memory leak when the bos descriptor changes. Releases the
old udev->bos and uses the new one.)
d8f00cd685f5c usb: hub: do not clear BOS field during reset device
Author: Du, Changbin <[email protected]>
Date: Mon Jan 18 21:02:42 2016 +0800
(In usb_reset_and_verify_device, removed the clearing of udev->bos
to NULL (as introduced in e3376d6c87ee) and modified the function's
done: path to check to see if the bos descriptor has changed, and if
so, release the old udev->bos and use the new one.)
I'm not a USB expert, so I don't know if Changbin's patch can be
extended to include the function's error paths as well, or if
usb_reset_and_verify_device can be fixed simply by jumping to
re_enumerate_no_bos in the case(s) that hub_port_init doesn't allocate a
new one.
Hopefully the analysis is correct and can help towards a patch. Let me
know if any additional information from the crash or additional tracing
would be helpful.
Thanks,
-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html