Hi Gaetan > -----Original Message----- > From: Gaëtan Rivet [mailto:gaetan.ri...@6wind.com] > Sent: Wednesday, December 13, 2017 11:56 PM > To: Matan Azrad <ma...@mellanox.com> > Cc: Adrien Mazarguil <adrien.mazarg...@6wind.com>; Thomas Monjalon > <tho...@monjalon.net>; dev@dpdk.org; sta...@dpdk.org > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device handling > > Hi again Matan, > > On Wed, Dec 13, 2017 at 05:09:16PM +0100, Gaëtan Rivet wrote: > > On Wed, Dec 13, 2017 at 03:48:46PM +0000, Matan Azrad wrote: > > > Hi Gaetan > > > Thanks for the review. > > > Some comments.. > > > > > > > -----Original Message----- > > > > From: Gaëtan Rivet [mailto:gaetan.ri...@6wind.com] > > > > Sent: Wednesday, December 13, 2017 5:17 PM > > > > To: Matan Azrad <ma...@mellanox.com> > > > > Cc: Adrien Mazarguil <adrien.mazarg...@6wind.com>; Thomas > Monjalon > > > > <tho...@monjalon.net>; dev@dpdk.org; sta...@dpdk.org > > > > Subject: Re: [PATCH v2 4/4] net/failsafe: fix removed device > > > > handling > > > > > > > > Hi Matan, > > > > > > > > On Wed, Dec 13, 2017 at 02:29:30PM +0000, Matan Azrad wrote: > > > > > There is time between the physical removal of the device until > > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and > > > > > applications still don't know about the removal and may call > > > > > sub-device control operation which should return an error. > > > > > > > > > > In previous code this error is reported to the application > > > > > contrary to fail-safe principle that the app should not be aware of > device removal. > > > > > > > > > > Add an removal check in each relevant control command error flow > > > > > and prevent an error report to application when the sub-device is > removed. > > > > > > > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD") > > > > > Fixes: b737a1e ("net/failsafe: support flow API") > > > > > Cc: sta...@dpdk.org > > > > > > > > > > > > > This patch is not a fix. > > > > It relies on an eth_dev API evolution. Without this evolution, > > > > this patch is meaningless and would break compilation if backported in > stable branch. > > > > > > > > > > It is a fix because the bug is finally solved by this patch. > > > I agree that it cannot be backported itself, but maybe all the series > > > should > be backported. > > > Other idea: > > > Add new patch which documents the bug and backport it. > > > Remove it in this patch and remove cc stable from it. > > > What do you think? > > > > > > > I think you could write a crude version that would not rely on the > > ethdev evolution (checking sdev->remove only), which would be > > incomplete but still better than nothing. > > And why not in this patch document the issue. > > Without any dependency outside failsafe, this could be backported. > > > > Then complete the fix with the API evolution if the new devops is > > accepted. > > > > > > Please remove those tags. > > > > > > > > > Signed-off-by: Matan Azrad <ma...@mellanox.com> > > > > > --- > > > > > drivers/net/failsafe/failsafe_flow.c | 18 ++++++++++------- > > > > > drivers/net/failsafe/failsafe_ops.c | 34 > ++++++++++++++++++++++----- > > > > ------ > > > > > drivers/net/failsafe/failsafe_private.h | 10 ++++++++++ > > > > > 3 files changed, 44 insertions(+), 18 deletions(-) > > > > > > > > < ... > > > > > > > > > > +/* > > > > > + * Check if sub device was removed. > > > > > + */ > > > > > +static inline int > > > > > +fs_is_removed(struct sub_device *sdev) { > > > > > + if (sdev->remove == 1 || > rte_eth_dev_is_removed(PORT_ID(sdev)) > > > > != 0) > > > > > + return 1; > > > > > + return 0; > > > > > +} > > > > > > > > Have you considered adding this check within the subdev iterator itself? > > > > I think it would prevent you from having to add it to each return > > > > value checks. > > > > > > > > It is still MT-unsafe anyway. > > > > > > > > > > This fix doesn't come to solve the MT issue, It comes to solve the error > report to application because of removal. > > > Adding the check in subdev iterator doesn't make sense for this issue. > > > > > > Matan. > > > > If you add this check in the iterator itself, you would skip removed > > devices before attempting operating upon them, right? > > > > Then it should probably help with your issue, unless you tested it and > > verified that it didnt? > > > > Something like this: > > > > ---8<--- > > > > diff --git a/drivers/net/failsafe/failsafe_private.h > > b/drivers/net/failsafe/failsafe_private.h > > index d81cc3ca6..62ddc0689 100644 > > --- a/drivers/net/failsafe/failsafe_private.h > > +++ b/drivers/net/failsafe/failsafe_private.h > > @@ -316,8 +316,12 @@ fs_find_next(struct rte_eth_dev *dev, > > subs = PRIV(dev)->subs; > > tail = PRIV(dev)->subs_tail; > > while (sid < tail) { > > + if (min_state > DEV_PROBED && > > + fs_is_removed(&sub[sid])) > > + goto next; > > if (subs[sid].state >= min_state) > > break; > > +next: > > sid++; > > } > > *sid_out = sid; > > > > --->8--- > > > > Only issue being that it is completely racy, but as this MT-unsafe > > property is inescapable we might as well ignore it and go for KISS. > > > > If that's enough, I would prefer instead of having this additional > > check added to all rte_eth operations. > > > > Ok, actually you were right here to do it this way. The "is_removed" > check needs to happen after the operation attempt to effectively mitigate > the possible race. Checking before attempting the call will be much less > effective. > > That being said, would it be cleaner to have eth_dev ops return -ENODEV > directly, and check against it within fail-safe? >
I think that according to "is_removed" semantic we must return a Boolean value (Each value different from '0' means that the device is removed) like other functions in c library (for example isspace()). Thanks. > -- > Gaëtan Rivet > 6WIND