> -----Original Message-----
> From: Ye, MingjinX <mingjinx...@intel.com>
> Sent: Thursday, November 2, 2023 10:39 AM
> To: Zhang, Qi Z <qi.z.zh...@intel.com>; dev@dpdk.org
> Cc: Yang, Qiming <qiming.y...@intel.com>; Zhou, YidingX
> <yidingx.z...@intel.com>; sta...@dpdk.org
> Subject: RE: [PATCH v3] net/ice: fix crash on closing representor ports
>
>
>
> > -----Original Message-----
> > From: Zhang, Qi Z <qi.z.zh...@intel.com>
> > Sent: Wednesday, November 1, 2023 6:49 PM
> > To: Ye, MingjinX <mingjinx...@intel.com>; dev@dpdk.org
> > Cc: Yang, Qiming <qiming.y...@intel.com>; Zhou, YidingX
> > <yidingx.z...@intel.com>; sta...@dpdk.org
> > Subject: RE: [PATCH v3] net/ice: fix crash on closing representor
> > ports
> >
> >
> >
> > > -----Original Message-----
> > > From: Ye, MingjinX <mingjinx...@intel.com>
> > > Sent: Wednesday, November 1, 2023 6:14 PM
> > > To: dev@dpdk.org
> > > Cc: Yang, Qiming <qiming.y...@intel.com>; Zhou, YidingX
> > > <yidingx.z...@intel.com>; Ye, MingjinX <mingjinx...@intel.com>;
> > > sta...@dpdk.org; Zhang, Qi Z <qi.z.zh...@intel.com>
> > > Subject: [PATCH v3] net/ice: fix crash on closing representor ports
> > >
> > > The data resource in struct rte_eth_dev is cleared and points to
> > > NULL when the DCF port is closed.
> > >
> > > If the DCF representor port is closed after the DCF port is closed,
> > > a segmentation fault occurs because the representor port accesses
> > > the data resource released by the DCF port.
> > >
> > > This patch checks if the resource is present before accessing.
> > >
> > > Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
> > > Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port
> > > closing")
> > > Cc: sta...@dpdk.org
> > >
> > > Signed-off-by: Mingjin Ye <mingjinx...@intel.com>
> > > ---
> > > v3: New solution.
> > > ---
> > > drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
> > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/ice/ice_dcf_vf_representor.c
> > > b/drivers/net/ice/ice_dcf_vf_representor.c
> > > index b9fcfc80ad..8c45e28f02 100644
> > > --- a/drivers/net/ice/ice_dcf_vf_representor.c
> > > +++ b/drivers/net/ice/ice_dcf_vf_representor.c
> > > @@ -111,14 +111,16 @@ ice_dcf_vf_repr_link_update(__rte_unused
> > struct
> > > rte_eth_dev *ethdev, static __rte_always_inline struct ice_dcf_hw *
> > > ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr) {
> > > - struct ice_dcf_adapter *dcf_adapter =
> > > - repr->dcf_eth_dev->data->dev_private;
> > > + struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;
> >
> > Seems this expose another issue, if dcf port already be closed, the
> > dcf_eth_dev instance could already be reused by another driver.
> Your analysis is spot on.
>
> The reason for the issue:
> Affected by 36c46e738120c381cf663c96692454c5aa75e203 commit.
> da9cdcd1f37220e87db23993d6352637d71df25b commit does not fix the
> issue.
>
> v2 patch:
> Notify every proxy port of DCF port status.
>
> v3 patch:
> Based on da9cdcd1f37220e87db23993d636352637d71df25b, made
> optimization to fix the issue.
>
> > So we can't assume dcf_eth_dev->data is NULL, I think you can refine
> > based on v2's method, but don't update dcf_valid flag in representor
> > port's dev_stop.
> Implementation difficulties:
> 1. When the DCF is created, all associated proxy ports are created and each
> proxy port (struct rte_eth_dev) pointer is recorded in an array.
> 2. When shutting down the DCF, all all proxy ports are notified at
> ice_dcf_vf_repr_stop_all. Finally the array is deleted.
>
> Based on the original scenario, notifying all agent ports of DCF state
> failure is
> the simplest and most effective way. This is very similar to the v2 patch
> scenario.
Yes, that's why I suggest to refine based on v2 after notice the issue on v3,
seems we need to maintain a per repr flag for DCF status, but need to update it
carefully.
>
> Check each DCF port flag?
> If the DCF port has failed and the resource has been released. the DCF port
> status, naturally, is not available. This is very similar to the v3 patch
> scenario.
Check dev->data is NULL does not work, a better way is to check eth_dev->state
and device name to make sure it is still used by DCF.
But I think v2's method is more reliable.
>
> Other ideas
> 1. Re-implement DCF port and proxy port communication, can't directly use
> (struct rte_eth_dev) pointer to access the device resources on the other end.
>
> 2. Designed to return an error when removing the proxy port directly. The
> proxy port device is managed by the control DCF (the proxy port is created by
> the DCF port and the removal work should be done by it).
>
> The above 2 options will bring a huge amount of work. It seems beyond the
> scope of this discussion.
>