On Tue, Dec 8, 2020 at 2:38 PM Jakub Kicinski <k...@kernel.org> wrote: > > On Sat, 5 Dec 2020 18:43:54 -0500 Jarod Wilson wrote: > > I'm seeing a system get stuck unable to bring a downed interface back up > > when it's got an updelay value set, behavior which ceased when logging > > spew was removed from bond_miimon_inspect(). I'm monitoring logs on this > > system over another network connection, and it seems that the act of > > spewing logs at all there increases rtnl lock contention, because > > instrumented code showed bond_mii_monitor() never able to succeed in it's > > attempts to call rtnl_trylock() to actually commit link state changes, > > leaving the downed link stuck in BOND_LINK_DOWN. The system in question > > appears to be fine with the log spew being moved to > > bond_commit_link_state(), which is called after the successful > > rtnl_trylock(). > > But it's not called under rtnl_lock AFAICT. So something else is also > spewing messages? > > While bond_commit_link_state() _is_ called under the lock. So you're > increasing the retry rate, by putting the slow operation under the > lock, is that right?
Partially, yes. I probably should have tagged this with RFC instead of PATCH, tbh. My theory was that the log spew, being sent out *other* network interfaces when monitoring the system or remote syslog or ssh was potentially causing some rtnl_lock() calls, so not spewing until after actually being able to grab the lock would lessen the problem w/actually acquiring the lock, but I ... don't know offhand how to verify that theory. > Also isn't bond_commit_link_state() called from many more places? > So we're adding new prints, effectively? Ah. Crap. Yes. bond_set_slave_link_state() is called quite a few places, and that in turn calls bond_commit_link_state(). > > I'm actually wondering if perhaps we ultimately need/want > > some bond-specific lock here to prevent racing with bond_close() instead > > of using rtnl, but this shift of the output appears to work. I believe > > this started happening when de77ecd4ef02 ("bonding: improve link-status > > update in mii-monitoring") went in, but I'm not 100% on that. > > > > The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat > > separate from the fix for the actual hang, but it eliminates a constant > > "invalid new link 3 on slave" message seen related to this issue, and it's > > not actually an invalid state here, so we shouldn't be reporting it as an > > error. > > Let's make it a separate patch, then. Sounds like Jay is confident that bit is valid, and I shouldn't be ending up in that state, unless something else is going wrong. -- Jarod Wilson ja...@redhat.com