>Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error code while >handling sub-device add > >On 09/10/20 20:30 +0000, Long Li wrote: >> >Subject: Re: [dpdk-dev] [PATCH] net/failsafe: check correct error >> >code while handling sub-device add >> > >> >On 05/10/20 11:42 +0200, Gaëtan Rivet wrote: >> >> Hi, >> >> >> >> On 02/10/20 17:01 -0700, Long Li wrote: >> >> > From: Long Li <lon...@microsoft.com> >> >> > >> >> > When adding a sub-device, it's possible that the sub-device is >> >> > configured successfully but later fails to start. This error >> >> > should not be >> >masked. >> >> >> >> Some of those errors are meant to be masked: -EIO, when the device >> >> is marked as removed at the ethdev level (see eth_err() in >rte_ethdev.c:819). >> >> >> >> > The driver needs to check the error status to prevent endless >> >> > loop of trying to start the sub-device. >> >> >> >> If the ethdev layer error is due to the device being removed, and >> >> failsafe loops on trying to sync the eth device to its own state, >> >> then an RMV event should have been emitted but wasn't or it was >> >> missed by failsafe. >> >> >> >> If the ethdev layer error is *not* due to the device being removed, >> >> the error should be != -EIO, and sdev->remove should not be set, so >> >> fs_err() should not mask it and it should be seen by the app. >> >> >> >> Can you provide the following details: >> >> >> >> * What is the return code of rte_eth_dev_start() that is masked in your >> >> start loop? >> >> >> >> * Is the device marked as removed in failsafe? >> >> >> >> * Is the device marked as removed in ethdev? >> >> >> >> * Was there an RMV event generated for the device? Whether yes or >no, >> >> is it correct? >> >> >> >> Thanks, >> >> >> > >> >Hello Li, >> > >> >I've found the previous mail thread [1] where you described how you >> >got this error. In your description, you say that you try unplug then >> >quick replug, before any event is processed? >> >> Hi Gaëtan, >> >> Sorry for getting back late. I had trouble with my email. >> >> I think the issue is that: when the failsafe driver tries to start the sub >> device, >it may fail. The failsafe driver needs to deal with the failure. Here is >another >log that I captured: >> >> net_failsafe: Starting sub_device 0 >> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation >> failure: Cannot allocate memory >> net_mlx4: cannot initialize common RSS resources (queue 0): unable to >> create Rx queue resources: Cannot allocate memory >> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate >> memory >> net_failsafe: Starting sub_device 0 >> net_mlx4: error while attaching Rx queue 0x11da428c0: CQ creation >> failure: Cannot allocate memory >> net_mlx4: cannot initialize common RSS resources (queue 0): unable to >> create Rx queue resources: Cannot allocate memory >> net_mlx4: 0x1bf2280: cannot initialize RSS resources: Cannot allocate >> memory >> >> It's called from fs_dev_start(). The MLX4 can't start, it's in a state being >removed in ethdev. But this error is being masked by fs_err(): >> >> static inline int >> fs_err(struct sub_device *sdev, int err) { >> /* A device removal shouldn't be reported as an error. */ >> if (sdev->remove == 1 || err == -EIO) >> return rte_errno = 0; >> return err; >> } >> >> So fs_dev_start() always return a success, returned to >failsafe_eth_dev_state_sync(), which is repeated called from the alarm >fs_hotplug_alarm(). This goes to an endless loop. >> >> I believe failsafe_eth_rmv_event_callback() has been called prior to this. It >sets sdev->remove = 1, but it didn't help with exiting the loop. fs_err() >always >returns 0, and fs_dev_start() is called from fs_hotplug_alarm(), that keeps re- >alarming itself. > >Ok, the issue here is that even with sdev->remove == 1, the device is not >actually removed, but it should. > >I think we discussed it with Stephen off-list not too long ago actually. >In his case the actual bug was in vdev_netvsc, but there was still this issue >in >failsafe. It is happening there: > > void > failsafe_dev_remove(struct rte_eth_dev *dev) > { > struct sub_device *sdev; > uint8_t i; > > FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) > /* ^^^^^^^^^^^^ HERE is the bug */ > if (sdev->remove && fs_rxtx_clean(sdev)) { > if (fs_lock(dev, 1) != 0) > return; > fs_dev_stats_save(sdev); > fs_dev_remove(sdev); > fs_unlock(dev, 1); > } > } > >A device with the remove flag will be cleaned up only if it is already in an >active >state. This is not correct, it should be removed in all cases -- even if it >stayed >stuck in PROBED state. > >Here is a fix I propose for this issue: >https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails. >dpdk.org%2Farchives%2Fdev%2F2020- >October%2F185921.html&data=04%7C01%7Clongli%40microsoft.com%7C >0cfccf19cf3d42ee549008d86eba3d6e%7C72f988bf86f141af91ab2d7cd011db47 >%7C1%7C0%7C637381093902558861%7CUnknown%7CTWFpbGZsb3d8eyJWIjo >iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C300 >0&sdata=j13foFEbNV7DqWxl2ahLloEZSbCO96lmPaqSfjU5KsQ%3D&r >eserved=0 > >Could you please test it and verify it solves your issue?
Thank you! I have been testing this patch for 24 hours and it worked well. > >> >> Why do we check (sdev->remove == 1 || err == -EIO) in fs_err()? If sdev- >>remove is set, this will always return 0, regardless of err. >> > >if sdev->remove == 1, there should not even be a call to sub-device start ops >actually. The device is flagged for removal, next thing to happen is to remove >it, even if the device reappeared on the bus in-between. The failsafe needs to >consume the removed flag (close the device then remove it), then it will be >ready to detect the new device and will probe it back from the beginning. > >> It's hard to reproduce with gdb. Hopefully I can get a good trace in gdb. >> >> Thanks, >> >> Long >> >> > >> >If that's the case, it seems a clear race condition, and an issue of >> >missing the removal event of the device. I would not say yet that the >> >bug is in failsafe, but it could be in ethdev. >> > >> >Can you please check whether the device removal event was properly >> >generated in rte_ethdev? Failsafe (and any other hotplug support >> >layer >> >actually) will depend on it so it should be first checked to work. >> > >> >Thanks, >> > >> >[1]: >> >>https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmails >. >> >dpdk.org%2Farchives%2Fdev%2F2020- >> >>September%2F182977.html&data=02%7C01%7Clongli%40microsoft.co >m >> >>%7C3e044201096c4f48508108d86c6f319d%7C72f988bf86f141af91ab2d7cd011 >d >> >>b47%7C1%7C0%7C637378572568404734&sdata=DAUVxBJQtJx8mUSWlP >I >> >DtenhpMmdiMPpIczmxOdrCqE%3D&reserved=0 >> > >> >-- >> >Gaëtan > >Regards, >-- >Gaëtan