>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while >handling sub-device add > >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote: >> Hi, >> >> On 02/10/20 17:01 -0700, Long Li wrote: >> > From: Long Li <lon...@microsoft.com> >> > >> > When adding a sub-device, it's possible that the sub-device is >> > configured successfully but later fails to start. This error should not be >masked. >> >> Some of those errors are meant to be masked: -EIO, when the device is >> marked as removed at the ethdev level (see eth_err() in rte_ethdev.c:819). >> >> > The driver needs to check the error status to prevent endless loop >> > of trying to start the sub-device. >> >> If the ethdev layer error is due to the device being removed, and >> failsafe loops on trying to sync the eth device to its own state, then >> an RMV event should have been emitted but wasn't or it was missed by >> failsafe. >> >> If the ethdev layer error is *not* due to the device being removed, >> the error should be != -EIO, and sdev->remove should not be set, so >> fs_err() should not mask it and it should be seen by the app. >> >> Can you provide the following details: >> >> * What is the return code of rte_eth_dev_start() that is masked in your >> start loop? >> >> * Is the device marked as removed in failsafe? >> >> * Is the device marked as removed in ethdev? >> >> * Was there an RMV event generated for the device? Whether yes or no, >> is it correct? >> >> Thanks, >> > >Hello Li, > >I've found the previous mail thread [1] where you described how you got this >error. In your description, you say that you try unplug then quick replug, >before any event is processed?
Hi Gaëtan, Sorry for getting back late. I had trouble with my email. I think the issue is that: when the failsafe driver tries to start the sub device, it may fail. The failsafe driver needs to deal with the failure. Here is another log that I captured: net_failsafe: Starting sub_device 0 net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory net_failsafe: Starting sub_device 0 net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation failure: Cannot allocate memory net_mlx4: cannot initialize common RSS resources (queue 0): unable to create Rx queue resources: Cannot allocate memory net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate memory It's called from fs_dev_start(). The MLX4 can't start, it's in a state being removed in ethdev. But this error is being masked by fs_err(): static inline int fs_err(struct sub_device *sdev, int err) { /* A device removal shouldn't be reported as an error. */ if (sdev->remove == 1 || err == -EIO) return rte_errno = 0; return err; } So fs_dev_start() always return a success, returned to failsafe_eth_dev_state_sync(), which is repeated called from the alarm fs_hotplug_alarm(). This goes to an endless loop. I believe failsafe_eth_rmv_event_callback() has been called prior to this. It sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() always returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re-alarming itself. Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev->remove is set, this will always return 0, regardless of err. It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb. Thanks, Long > >If that's the case, it seems a clear race condition, and an issue of missing >the >removal event of the device. I would not say yet that the bug is in failsafe, >but >it could be in ethdev. > >Can you please check whether the device removal event was properly >generated in rte_ethdev? Failsafe (and any other hotplug support layer >actually) will depend on it so it should be first checked to work. > >Thanks, > >[1]: >https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails. >dpdk.org%2Farchives%2Fdev%2F2020- >September%2F182977.html&data=02%7C01%7Clongli%40microsoft.com >%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011d >b47%7C1%7C0%7C637378572568404734&sdata=DAUVxBJQtJx8mUSWlPI >DtenhpMmdiMPpIczmxOdrCqE%3D&reserved=0 > >-- >Gaëtan