01/02/2023 08:52, Andrew Rybchenko: > On 1/18/23 18:44, Rongwei Liu wrote: > > +* **Added process state in ethdev to improve live migration.** > > + > > + Hot upgrade of an application may be accelerated by configuring > > + the new application in standby state while the old one is still active. > > + Such double ethdev configuration of the same device is possible > > + with the added process state API. > > + > > Will we have the same API for other device classes? Any ideas > how to avoid duplication? (I'm afraid we don't have generic > place to put such functionality).
It could be needed to have such feature in other classes, yes. That's a device feature, so it must be in each class, with different flags/options per class. > > + /* Check if all devices support process role. */ > > + RTE_ETH_FOREACH_DEV(port_id) { > > + dev = &rte_eth_devices[port_id]; > > + if (*dev->dev_ops->process_set_role != NULL && > > + *dev->dev_ops->dev_infos_get != NULL && > > + (*dev->dev_ops->dev_infos_get)(dev, &dev_info) == 0 && > > + (dev_info.dev_capa & RTE_ETH_DEV_CAPA_PROCESS_ROLE) != > > 0) > > Why do we need both non-NULL callback and dev_capa flag? > Isn't just non-NULL callback sufficient? We could have a callback returning ENOTSUP. Anyway we need the capability for application query. > > + continue; > > + rte_errno = ENOTSUP; > > + return -rte_errno; > > + } > > + /* Call the driver callbacks. */ > > + RTE_ETH_FOREACH_DEV(port_id) { > > + dev = &rte_eth_devices[port_id]; > > + if ((*dev->dev_ops->process_set_role)(dev, standby, flags) < 0) > > Please, add error logging on failure. good idea > > > + goto failure; > > IMHO it would be useful to have info logs on success. When setting the role, we could have debug log, yes. > > + ret++; > > ret variable name is very confusing, please, be > more specific in variable naming. > > > + } > > + return ret; > > + > > +failure: > > + /* Rollback all changed devices in case one failed. */ > > + if (ret) { > > Compare vs 0 > > > + RTE_ETH_FOREACH_DEV(port_id) { > > + dev = &rte_eth_devices[port_id]; > > + (*dev->dev_ops->process_set_role)(dev, !standby, flags); > > IMHO, it would be useful to have error logs on rollback > failure and info logs on success. > > > + if (--ret == 0) > > + break; > > + } > > + } > > + rte_errno = EPERM; > > Why is it always EPERM? Isn't it better to forward > failure code from driver? > > > + return -rte_errno; > > +} Thanks for the good detailed review Andrew. > > --- a/lib/ethdev/rte_ethdev.h > > +++ b/lib/ethdev/rte_ethdev.h > > @@ -1606,6 +1606,8 @@ struct rte_eth_conf { > > #define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP RTE_BIT64(3) > > /** Device supports keeping shared flow objects across restart. */ > > #define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4) > > +/** Device supports process role changing. @see rte_eth_process_set_active > > */ > > There is no rte_eth_process_set_active() > > > +#define RTE_ETH_DEV_CAPA_PROCESS_ROLE RTE_BIT64(5) > > /**@}*/ > > > > /* > > @@ -2204,6 +2206,60 @@ int rte_eth_dev_owner_delete(const uint64_t > > owner_id); > > int rte_eth_dev_owner_get(const uint16_t port_id, > > struct rte_eth_dev_owner *owner); > > > > +/** > > + * @warning > > + * @b EXPERIMENTAL: this API may change without prior notice > > + * > > + * Set the role of the process to active or standby, > > + * affecting network traffic handling. > > + * > > + * If one device does not support this operation or fails, > > + * the whole operation is failed and rolled back. > > + * > > + * It is forbidden to have multiple processes with the same role > > + * unless only one of them is configured to handle the traffic. > > Why is it not allowed to have many processes in standby? > > > + * > > + * The application is active by default. > > + * The configuration from the active process is effective immediately > > + * while the configuration from the standby process is queued by hardware. > > I'm afraid error handling could be tricky in this case. > I think it makes sense to highlight that configuration > propagation failures will be returned on attempt to set > the process active. > > > + * When configuring the device from a standby process, > > + * it has no effect except for below situations: > > + * - traffic not handled by the active process configuration > > + * - no active process > > + * > > + * When a process is changed from a standby to an active role, > > + * all preceding configurations that are queued by hardware > > + * should become effective immediately. > > + * Before role transition, all the traffic handling configurations > > + * set by the active process should be flushed first. > > + * > > + * In summary, the operations are expected to happen in this order > > + * in "old" and "new" applications: > > + * device: already configured by the old application > > + * new: start as active > > + * new: probe the same device > > + * new: set as standby > > + * new: configure the device > > + * device: has configurations from old and new applications > > + * old: clear its device configuration > > + * device: has only 1 configuration from new application > > + * new: set as active > > + * device: downtime for connecting all to the new application > > + * old: shutdown > > + * > > + * @param standby > > + * Role active if false, standby if true. > > Typically API with boolean parameters is bad. May be in this > particular case it is better to have two functions: > rte_eth_process_set_active() and rte_eth_process_set_standby(). Why? It could be an enum as well. > > + * @param flags > > + * Role specific flags. > > Please, define namespace for flags. > > > + * @return > > + * Positive value on success, -rte_errno value on error: > > + * - (> 0) Number of switched devices. > > Isn't is success if there is no devices to switch? It should be >= 0 I guess.