On 2/28/25 12:31, Cosmin Ratiu wrote: > On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: >> On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: >>>>> One more thing - note I'm not an xfrm expert by far but it >>>>> seems to me here you have >>>>> to also callĀ xdo_dev_state_free() with the old active slave >>>>> dev otherwise that will >>>>> never get called with the original real_dev after the switch to >>>>> a new >>>>> active slave (or more accurately it might if the GC runs >>>>> between the switching >>>>> but it is a race), care must be taken wrt sequence of events >>>>> because the XFRM >>>> >>>> Can we just call xs->xso.real_dev->xfrmdev_ops- >>>>> xdo_dev_state_free(xs) >>>> no matter xs->xso.real_dev == real_dev or not? I'm afraid calling >>>> xdo_dev_state_free() every where may make us lot more easily. >>>> >>> >>> You'd have to check all drivers that implement the callback to >>> answer that and even then >>> I'd stick to the canonical way of how it's done in xfrm and make >>> the bond just passthrough. >>> Any other games become dangerous and new code will have to be >>> carefully reviewed every >>> time, calling another device's free_sa when it wasn't added before >>> doesn't sound good. >>> >>>>> GC may be running in parallel which probably means that in >>>>> bond_ipsec_free_sa() >>>>> you'll have to take the mutex before calling >>>>> xdo_dev_state_free() and check >>>>> if the entry is still linked in the bond's ipsec list before >>>>> calling the free_sa >>>>> callback, if it isn't then del_sa_all got to it before the GC >>>>> and there's nothing >>>>> to do if it also called the dev's free_sa callback. The check >>>>> for real_dev doesn't >>>>> seem enough to protect against this race. >>>> >>>> I agree that we need to take the mutex before calling >>>> xdo_dev_state_free() >>>> in bond_ipsec_free_sa(). Do you think if this is enough? I'm a >>>> bit lot here. >>>> >>>> Thanks >>>> Hangbin >>> >>> Well, the race is between the xfrm GC and del_sa_all, in bond's >>> free_sa if you >>> walk the list under the mutex before calling real_dev's free >>> callback and >>> don't find the current element that's being freed in free_sa then >>> it was >>> cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk >>> that >>> list and clean the entries. I think it should be fine as long as >>> free_sa >>> was called once with the proper device. >> >> OK, so the free will be called either in del_sa_all() or free_sa(). >> Something like this? >> > [...] > > Unfortunately, after applying these changes and reasoning about them > for a bit, I don't think this will work. There are still races left. > For example: > 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and > before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all > is called in parallel, doesn't call delete on xs (because it's dead), > then calls free (incorrect without delete first), then removes the list > entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, > and calls delete (incorrect, out of order with free). Finally, > bond_ipsec_free_sa is called, which fortunately doesn't do anything > silly in the new proposed form because xs is no longer in the list. > > 2. A more sinister form of the above race can happen when > bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and > immediately after __xfrm_state_delete marks xs as DEAD and calls > bond_ipsec_del_sa() which happily calls delete on real_dev again. > > In order to fix these races (and others like it), I think > bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >> lock for each xs being processed. This would prevent xfrm from > concurrently initiating add/delete operations on the managed states. > > Cosmin.
Duh, right you are. The state is protected by x->lock and cannot be trusted outside of it. If you take x->lock inside the list walk with the mutex held you can deadlock. Cheers, Nik