On Tue, 15 Feb 2022 22:16:28 +0530 Vipul Ashri <vipul.as...@oracle.com> wrote:
> On 2/14/2022 10:24 PM, Stephen Hemminger wrote: > > On Mon, 14 Feb 2022 13:09:19 +0000 > > Vipul Ashri <vipul.as...@oracle.com> wrote: > > > >> PORT 0 supports 16 rx queues and 16 tx queues (driver_name = net_failsafe, > >> driver_type = 16) > >> > >> PORT 0 is polling for link-change, interrupts disabled > >> > >> [DPDK] tap_flow_create(): Kernel refused TC filter rule creation (17): > >> File exists > > Looks like secondary process support doesn't work with the flow rules logic. > > Maybe after that you are into error paths that may not recover correctly?? > Thanks! Stephen for looking at my analysis, > > yes some hotplug synchronization issue between eal_intr_thread and primary > thread, but we are able to recover with this patch. > > Reason is this fail-safe flow is inside our custom added boot-time > polling to > update DPDK stats and calling ifindex ioctl to get interface data. > Ideally we > should not start polling so early. but moreover calling ifindex ioctl is > generic > functionality and should not break failsafe. We added this patch and > gracefully > prevented the so many multiple crashes. > > Setup details : > Azure testbed with Accelerated Networking(SRIOV) enabled, failsafe using > tap + > mellanox driver. I don't work for Azure anymore, so can't really test this. A short explanation why this patch is stalled. It seems like this patch is trying to avoid a crash when an earlier problem occurred, it is ok to do that but the original problem is still there and the testing it is impossible without having modified application. For the normal user, this just adds more always true checks in the configuration path. Ok, but it does add clutter. Since failsafe should be deprecated fixing this seems less relevant as well.