This is not an issue caused by ModemManager or any of the host drivers. Nut I don't know where else to ask. And this documents the issue for my next Google search :-)
I'm using a Sierra Wireless EM7565 connected to the USB3 port of a Linksys WRT1900ACv1 "Mamba" (Marvell Armada 370/XP SoC), which has an Etron EJ168 PCIe connected xhci controller. The system runs a recent OpenWrt build (v5.15 kernel) with ModemManager. This works fine after the initial power-on or after physically reconnecting the modem each time the host is rebooted. But it fails consistently if I reboot the host without disconnecting the modem. The xhci driver will then log these messages while MM is probing the modem: [ 78.014486] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.021419] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.028341] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.035270] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.043035] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.049973] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.056886] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.063808] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.070723] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.077633] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.084552] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.092256] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.099154] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 78.106082] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.007691] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.014629] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.021554] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.028474] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.035391] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.042306] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.049212] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.056669] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.063590] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.070504] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX [ 81.077411] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX and probing eventually fails. Likewise with any attempt to use mbimcli etc. The /dev/cdc-wdm0 device appears dead. So I assume those warnings are real errors, masked by the XHCI_TRUST_TX_LENGTH quirk applied to this xhci device a decade ago. My main concern is figuring out an acceptable host based workaround. Physically reconnecting the modem after each host reboot is not an option. I've tried usbreset, but the EM7565 will only reset its USB device controller Which causes a successful USB disconnect/reconnect and new host driver probing etc. But the modem firmware is not rebooted by that, and the error state is not cleared. I believe the root cause is revealed by logging into the modem with "adb shell" and running dmesg (the host system was rebooted 289 seconds after the last power cycling here): [ 23.224873] mbimd: IP config response sent [ 60.640200] sierra_startup_monitor [ 289.304104] msm-dwc3 8a00000.ssusb: Avail curr from USB = 0 [ 289.304165] diag: USB channel diag disconnected [ 289.305154] dwc3 8a00000.dwc3: request c67e8b00 was not queued to ep0out [ 289.306464] android_work: android_work: sent uevent USB_STATE=DISCONNECTED [ 289.362231] dwc3 8a00000.dwc3: failed to stop controller [ 289.362592] read descriptors [ 289.362609] read strings [ 289.362643] Match for Function Name: diag, Int #0 [ 289.362651] name:diag, interface id:0, id:0 [ 289.362690] Match for Function Name: Function FS Gadget, Int #1 [ 289.362699] name:Function FS Gadget, interface id:1, id:1 [ 289.362712] Match for Function Name: nmea, Int #2 [ 289.362719] name:nmea, interface id:2, id:2 [ 289.362739] Match for Function Name: modem, Int #3 [ 289.362746] name:modem, interface id:3, id:3 [ 289.362772] Match for Function Name: mbim, Int #12 [ 289.362779] name:mbim, interface id:12, id:4 [ 289.362786] Match for Function Name: mbim, Int #13 [ 289.362793] name:mbim, interface id:13, id:5 [ 289.362815] msm-dwc3 8a00000.ssusb: ep [gsi-epin,29] already configured as msm endpoint [ 289.362826] msm-dwc3 8a00000.ssusb: ep [gsi-epout,30] already configured as msm endpoint [ 289.414838] android_work: android_work: did not send uevent (0 0 00000000) [ 289.715338] android_work: android_work: did not send uevent (0 0 00000000) [ 289.924428] android_work: android_work: sent uevent USB_STATE=CONNECTED [ 289.990573] android_usb gadget: super-speed config #1: 86000c8.android_usb [ 289.990633] diag: USB channel diag connected [ 289.990795] msm-dwc3 8a00000.ssusb: Avail curr from USB = 900 [ 290.050233] android_work: android_work: sent uevent USB_STATE=CONFIGURED [ 298.260790] ipa_usb_notify_cb: ipa_usb_notify_cb: Set net_ready_trigger [ 308.165691] ------------[ cut here ]------------ [ 308.165857] WARNING: CPU: 0 PID: 730 at /work/mdm9x50/apps_proc/oe-core/build/tmp-glibc/work-shared/mdm9650-perf/kernel-source/include/linux/kref.h:47 glink_get_ch_ctx+0x48/0x6c() [ 308.165879] Modules linked in: alx(O) embms_kernel(O) [ 308.165930] CPU: 0 PID: 730 Comm: kworker/u2:10 Tainted: G W O 3.18.31 #1 [ 308.165966] Workqueue: glink_pkt_wq glink_pkt_queue_rx_intent_worker [ 308.166035] [<c001483c>] (unwind_backtrace) from [<c0011dc4>] (show_stack+0x10/0x14) [ 308.166092] [<c0011dc4>] (show_stack) from [<c001f1e8>] (warn_slowpath_common+0x68/0x88) [ 308.166144] [<c001f1e8>] (warn_slowpath_common) from [<c001f298>] (warn_slowpath_null+0x18/0x20) [ 308.166191] [<c001f298>] (warn_slowpath_null) from [<c0291aac>] (glink_get_ch_ctx+0x48/0x6c) [ 308.166240] [<c0291aac>] (glink_get_ch_ctx) from [<c0292734>] (glink_queue_rx_intent+0x18/0x93c) [ 308.166290] [<c0292734>] (glink_queue_rx_intent) from [<c02baad0>] (glink_pkt_queue_rx_intent_worker+0x64/0x120) [ 308.166340] [<c02baad0>] (glink_pkt_queue_rx_intent_worker) from [<c0032108>] (process_one_work+0x244/0x3fc) [ 308.166389] [<c0032108>] (process_one_work) from [<c0032f04>] (worker_thread+0x2f0/0x440) [ 308.166439] [<c0032f04>] (worker_thread) from [<c00366dc>] (kthread+0xc4/0xd8) [ 308.166489] [<c00366dc>] (kthread) from [<c000e6e0>] (ret_from_fork+0x14/0x34) [ 308.166515] ---[ end trace e8ab27ce022bebf1 ]--- [ 308.166546] <CORE> (null):(null):DATA40_CNTL[8:0] glink_queue_rx_intent: Channel is not fully opened [ 308.166609] <GLINK_PKT> err: glink_pkt_queue_rx_intent_worker queue_rx_intent failed [ 308.166657] <CORE> (null):(null):DATA40_CNTL[8:0] glink_queue_rx_intent: Channel is not fully opened [ 308.166693] <GLINK_PKT> err: glink_pkt_queue_rx_intent_worker queue_rx_intent failed [ 308.182553] ------------[ cut here ]------------ [ 308.182568] kernel BUG at /work/mdm9x50/apps_proc/oe-core/build/tmp-glibc/work-shared/mdm9650-perf/kernel-source/mm/slub.c:3403! [ 308.182577] Internal error: Oops - BUG: 0 [#1] PREEMPT ARM [ 308.182584] Modules linked in: alx(O) embms_kernel(O) [ 308.182602] CPU: 0 PID: 1007 Comm: mbimd Tainted: G W O 3.18.31 #1 [ 308.182611] task: c151b480 ti: c4bfa000 task.ti: c4bfa000 [ 308.182623] PC is at kfree+0xd8/0x1e4 [ 308.182633] LR is at free_pipe_info+0x60/0x6c [ 308.182643] pc : [<c00e0c54>] lr : [<c00ee46c>] psr: 40070013 [ 308.182643] sp : c4bfbf20 ip : c0e9f80c fp : 7f615358 [ 308.182653] r10: c49dcc08 r9 : c6ce1d90 r8 : c5401ee0 [ 308.182661] r7 : c00ee46c r6 : c577a000 r5 : 00000010 r4 : cfd76d28 [ 308.182669] r3 : 00000000 r2 : 000c4d28 r1 : 00000000 r0 : cfcb2000 [ 308.182678] Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user [ 308.182687] Control: 10c53c7d Table: 8501c059 DAC: 00000015 [ 308.182695] Process mbimd (pid: 1007, stack limit = 0xc4bfa230) [ 308.182703] Stack: (0xc4bfbf20 to 0xc4bfc000) [ 308.182716] bf20: c4c97700 00000010 00000018 00000008 c5401ee0 c6ce1d90 c49dcc08 c00ee46c [ 308.182729] bf40: c4c97700 c49dcc00 c5402cb8 c00ee590 c49dcc00 c5402cb8 00000000 c00e83e4 [ 308.182742] bf60: 00000000 00000000 c151b90c c151b480 00000000 c0eefed0 c000e7c4 c4bfa000 [ 308.182755] bf80: 00000000 c0035218 c49dcc00 c000e7c4 c4bfa000 c4bfbfb0 00000004 c0011794 [ 308.182768] bfa0: 7f653a58 00000000 7f653918 c000e688 00000001 b6ecb060 00000001 b6ecce8c [ 308.182781] bfc0: 7f653a58 00000000 7f653918 00000004 be8a7b54 7f615390 7f653918 7f615358 [ 308.182793] bfe0: 00000002 be8a7ae8 b6a5a860 b6d3a454 60060010 0000004d 00000000 00000000 [ 308.182816] [<c00e0c54>] (kfree) from [<c00ee46c>] (free_pipe_info+0x60/0x6c) [ 308.182833] [<c00ee46c>] (free_pipe_info) from [<c00ee590>] (pipe_release+0xa0/0xb0) [ 308.182849] [<c00ee590>] (pipe_release) from [<c00e83e4>] (__fput+0xd4/0x1d8) [ 308.182866] [<c00e83e4>] (__fput) from [<c0035218>] (task_work_run+0x90/0xa4) [ 308.182885] [<c0035218>] (task_work_run) from [<c0011794>] (do_work_pending+0xb4/0xc8) [ 308.182902] [<c0011794>] (do_work_pending) from [<c000e688>] (work_pending+0xc/0x20) [ 308.182915] Code: 1a00000b e5943000 e3130903 1a000000 (e7f001f2) [ 308.182924] ---[ end trace e8ab27ce022bebf2 ]--- So it looks like the mbimd process in the modem manages to crash the modem kernel. Nice... But what can be done about that? From my point of view, fixing the issue in the modem firmware is out of the question. The modem is running SWI9X50C_01.14.13.00, which the latest available. We need a woraround for the host system. But what? Since I have adb access, I can obviously just reboot the modem. Which works. But what about the "normal" modem configuration? Figure out a way to cut USB port power? I guess this is possible on that platform, although I don't think it's implemented. It's not a very generic solution, though. But maybe this isn't a very generic problem? Maybe it's just this exact combo of host controller, device and continous power to the port during reboot (which I'm very much in favour of in general)? Have anyone else noticed similar problems with the EM7565? Or any other Qualcomm MDM9x50 device? I assume this bug isn't really Sierra Wireless specific, but rather something they got from Qualcomm. You'll obviously not see this problem if your host system cuts power to the USB port on reboots. Which unfortunately often is the case. I've tried to provoke the issue by e.g unbinding and rebinding the xhci_pci driver, or by resetting the USB device. But none of these actions cause the error to show up. There are some occasional snags, but nothing casuing MM probe failures. The modem firmware logs the suspicious "dwc3 8a00000.dwc3: failed to stop controller" on unbind, but it doesn't crash. And it works as usual on rebind. Bjørn