Sorry, I replied to Marek only but meant to reply to all. El Fri, Jun 09, 2023 at 03:20:33AM +0200, Marek Vasut deia:
> > No. Well, in some tests yes and some no, but I got the error in all cases. > > This is doubtful. It is mandatory to run 'usb start' or 'usb reset' before > you would get any meaningful result out of 'usb info'. Without either, you > would get 'USB is stopped.' message. Could it be there are some extra > scripts in your environment that manipulate the USB ? > I saw usb_boootdev_hunt() calls usb_init() in drivers/usb/host/usb_bootdev.c But maybe I don't get that called and it's really something silly in my setup as you say later... Maybe it doesn't get called unless it finds nothing else useful to boot. > > Can you test with stock U-Boot ? > I don't know. I'll see if I have time ... I'd rather read the code to understand what's the condition for finding bootdevices... > Can you test with another USB stick, i.e. is the issue specific to this USB > stick ? > I could test this, yes. > Is the issue specific to this partition layout of this USB stick, i.e. if > you clone (dd if=... of=...) the content of the USB stick to another USB > stick, does the error still occur. > I'll try to partition and flash a new USB. > [...] > > > Model: Radxa ROCK Pi 4B > > Net: eth0: ethernet@fe300000 > > Hit any key to stop autoboot: 2 1 0 > > rockchip_pcie pcie@f8000000: PCIe link training gen1 timeout! > > Bus usb@fe380000: USB EHCI 1.00 > > Bus usb@fe3c0000: USB EHCI 1.00 > > Bus usb@fe800000: Register 2000140 NbrPorts 2 > > Starting the controller > > USB XHCI 1.10 > > Bus usb@fe900000: Register 2000140 NbrPorts 2 > > Starting the controller > > USB XHCI 1.10 > > scanning bus usb@fe380000 for devices... 1 USB Device(s) found > > scanning bus usb@fe3c0000 for devices... 2 USB Device(s) found > > scanning bus usb@fe800000 for devices... 1 USB Device(s) found > > scanning bus usb@fe900000 for devices... 1 USB Device(s) found > > rockchip_pcie pcie@f8000000: failed to find ep-gpios property > > ethernet@fe300000 Waiting for PHY auto negotiation to complete......... > > TIMEOUT ! > > Could not initialize PHY ethernet@fe300000 > > rockchip_pcie pcie@f8000000: failed to find ep-gpios property > > ethernet@fe300000 Waiting for PHY auto negotiation to complete......... > > TIMEOUT ! > > Could not initialize PHY ethernet@fe300000 > > Is this some $preboot script which initializes your hardware ? > Mmm... yes, I used to have it... I thought not in this test, but I'd better recheck Anyway, one should be allowed to stop the boot, call usb start and usb tree and don't get a reset, shouldn't one? > => printenv preboot > I'll send this later when I repeat the test. I'd like to find a minimal test case or something... > > => usb tree > > USB device tree: > > 1 Hub (480 Mb/s, 0mA) > > u-boot EHCI Host Controller > > 1 Hub (480 Mb/s, 0mA) > > | u-boot EHCI Host Controller > > | > > uclass_id=64 > > |+-2 Mass Storage (480 Mb/s, 200mA) > > TDK LoR TF10 07032B6B1D0ACB96 > > uclass_id=22 > > uclass_id=25 > > "Synchronous Abort" handler, esr 0x96000010, far 0x948 > > elr: 00000000002157d4 lr : 00000000002157d4 (reloc) > > elr: 00000000f3f4f7d4 lr : 00000000f3f4f7d4 > > Take the u-boot (unstripped elf) which matches this binary, and run > aarch64-...objdump -lSD on it, then search for the $lr value, see > doc/develop/crash_dumps.rst for details. That should tell you where exactly > the crash occurred. Where did it occur ? > I didn't do it exactly so, but from u-boot.map I gathered that it was in cmd/usb.c and the fact that my patch fixed it implies the problem is the functions usb_show_tree_graph() or usb_show_info() get called recursively with null as a first parameter. Now I don't have that u-boot.map anymore and would have to repeat the experiment, to find out exactly as you say, so I won't do it right now. But thanks, understood. The reason usb_show_tree_graph() gets called with a null usb_device * is that the code in cmd/usb.c for usb info and usb tree assumes everything a UCLASS_MASS_STORAGE device can have as children are devices with some usb_device in their dev_get_parent_priv(). It carves out exceptions to this general rule for UCLASS_USB_EMUL and UCLASS_BLK, but not for UCLASS_BOOTDEV. When it finds a child that is UCLASS_BOOTDEV it happily recurses on it passing its parent_priv as usb_device, but the bootdev code did not put any usb_device there, it's null. So the first access causes a null pointer dereference. I would have to wrap my mind around more code to start understanding if it's better to give that UCLASS_BOOTDEV some usb_device as parent priv data, or it is better to give USB devices that can be enumerated for listing (usb tree or usb info) some RECURSIBLE flag that indicates their priv parent data is reliably a usb_device. So checking that the alledged usb_device at least isn't null as in my patch is possibly a partial solution. I'm sure if it's null we shouldn't call, but if it points to something other than a usb_device we shouldn't either, and it doesn't address why it is null (well, because it's not really a USB internal node, not even a proper leaf, so it shouldn't be recursed anyway). In usb_show_info() it is similar, usb_display_desc() gets called with a null udev because that's what came in. Recursion is avoided for UCLASS_USB_EMUL or UCLASS_BLK but not for UCLASS_BOOTDEV. A different solution could be to expand the exception to UCLASS_BOOTDEV, but it still seems a wrong strategy to expect to know everything you can find so you can list all the exceptions, instead of checking that you have something expected that you can recurse on. Not completely sure, just smells so to me. At least checking for null is more general. Maybe the solution is to fix common/usb_storage.c when it calls bootdev_setup_sibling_blk(), to ensure there's some useful usb_device there as parent priv. Or something needs fixing in drivers/usb/host/usb_bootdev.c ??? But I'm not sure. We really shouldn't call recursively for usb info or tree anyway. For now I was trying to understand when that UCLASS_BOOTDEV is added and why, and whether it should be removed at some time. This should hint me at what is the minimum scenario to reproduce the issue. > > The NAK is really only to prevent this from accidentally going on. > The NAK it's OK. I like people to want to understand stuff. Don't worry. > Please see above, maybe it could be narrowed down ? I'll see if I can test better and send more useful reports. Not sure when. I'm not sure I'll have the time to learn all I need. I just hoped someone else had it in mind...