Hi Ophir, Adrien, On Fri, Mar 02, 2018 at 08:12:58PM +0100, Adrien Mazarguil wrote: > On Wed, Feb 28, 2018 at 04:47:30PM +0000, Ophir Munk wrote: > > The following scenario causes a crash in function mlx4_get_ifname > > 1. On testpmd startup mlx4 device is probed and started > > 2. mlx4 sriov is disabled. As a result an RMV event is sent to > > testpmd which closes the device and nullify the priv struct > > members. > > 3. Running 'show port info all' in testpmd results in segmentation > > fault because of accessing NULL pointer priv->ctx > > > > The fix is to return with an error from mlx4_get_ifname() if priv->ctx > > member is NULL. > > So priv->ctx is NULL after mlx4_dev_close() was called. In my opinion the > fact testpmd is allowed to call rte_eth_dev_info_get() on a closed port is > fishy, there are quite a few other ethdev callbacks that may end up > crashing in such a scenario. > > Since commit 284c908cc588 ("app/testpmd: request device removal interrupt"), > testpmd automatically calls rte_eth_dev_close() and rte_eal_dev_detach() > after receiving a removal event on a port. > > rte_eth_dev_close() documentation says: > > "Close a stopped Ethernet device. The device cannot be restarted! > The function frees all resources except for needed by the > closed state. To free these resources, call rte_eth_dev_detach()." > > Unfortunately testpmd calls rte_*eal*_dev_detach() not > rte_*eth*_dev_detach(), the former doesn't release ethdev ports while the > latter does. I think it's a mistake, there's no point in keeping a closed > device around after it's been physically unplugged. > > In short, rmv_event_callback() should call detach_port() instead of > rte_eal_dev_detach(). >
This is correct, the issue is actually testpmd making a mistake when calling directly rte_eal_dev_detach(). The fix should thus be simply to change this call. > > Fixes: 61cbdd419478 ("net/mlx4: separate device control functions") > > Cc: sta...@dpdk.org > > > > Signed-off-by: Ophir Munk <ophi...@mellanox.com> > > The above suggests the problem is actually in testpmd and was introduced in > v17.05 by commit 284c908cc588. The proposed patch is a workaround that > doesn't address the underlying issue, thus NACK unless proven otherwise :) > > > --- > > drivers/net/mlx4/mlx4_ethdev.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/net/mlx4/mlx4_ethdev.c b/drivers/net/mlx4/mlx4_ethdev.c > > index 3bc6927..cca5223 100644 > > --- a/drivers/net/mlx4/mlx4_ethdev.c > > +++ b/drivers/net/mlx4/mlx4_ethdev.c > > @@ -67,6 +67,9 @@ mlx4_get_ifname(const struct priv *priv, char > > (*ifname)[IF_NAMESIZE]) > > char match[IF_NAMESIZE] = ""; > > > > { > > + if (priv->ctx == NULL) > > + return -ENOENT; > > + > > MKSTR(path, "%s/device/net", priv->ctx->device->ibdev_path); > > > > dir = opendir(path); > > -- > > 2.7.4 > > > > -- > Adrien Mazarguil > 6WIND -- Gaëtan Rivet 6WIND