On Wed, Mar 04, 2026 at 09:38:50AM -0800, Brian Bunker wrote:
> Ben and Martin,
> 
> Sorry for the delayed response. I sent this to you both initially instead of
> to the list because I thought you might have some insight into the issue.
> I have an 'alua' checker that I will post where I found this issue. I
> didn't actually
> find it in the 'tur' checker but I could see it is prone to the same
> issue. I have a
> test I created to show the issue called test_tur.c. I will send this
> too if it helps.

If you have a test that can trigger the issue you are seeing, that would
be really helpful it seeing what's going on here.

-Ben

> The problem when I didn't use this patch was that my checker the entire
> checker thread would stall. It didn't just correct itself. Until other
> paths failed
> the checker thread was deadlocked somehow.
> 
> In the 'tur' checker without this change, the 'running != 0' will lead
> to PATH_PENDING
> and MSG_TUR_RUNNING without looking at what those actual values are.
> It might actually
> be finished, but running hasn't changed to 0 yet, and ct->thread is not
> 0 on returning. I am not sure what extra conditions were true when my checker
> thread deadlocks, but with this change the deadlock never happens.
> 
> Thanks,
> Brian
> 
> On Fri, Feb 27, 2026 at 12:36 AM Martin Wilck <[email protected]> wrote:
> >
> > On Thu, 2026-02-26 at 13:48 -0500, Benjamin Marzinski wrote:
> > > On Wed, Feb 25, 2026 at 11:00:44PM +0100, Martin Wilck wrote:
> > > >
> > > > I'm still undecided if it's worth changing this code in order to
> > > > reinstate some paths 1s earlier in certain sitautions.
> > > >
> > > > Thoughts?
> > >
> > > I don't see any problems with this. But I also didn't see any real
> > > problems with Brian's version (although this version does remove most
> > > of
> > > my theoretic issues. Like I said before. If we hang in the condlog,
> > > we've got bigger problems than a stray checker thread). If we do
> > > this,
> > > we should probably document the uatomic_add_return() call,
> >
> > Of course, this was not an official patch submission, just an idea how
> > we might improve this code. It's not exactly the same thing; Brian's
> > patch meant simply ignoring ct->running, while mine would make the ct-
> > >running check reliable, albeit with changed semantics.
> >
> > >  or just use
> > > cmm_smp_mb() directly. But I still not sure what we get out of this,
> > > other than a very slight chance to finish the checker a second
> > > sooner.
> >
> > Yes. But it's also a fact that the simple atomic_read() test in the
> > current code isn't reliable. It can well return true when ct->running
> > has already been set to false. I guess the implementation tends to be
> > as light-weight as possible, avoiding a memory barrier (or taking a
> > lock with the side effect of a memory barrier) in the "fast path". That
> > comes with the price that we may miss an already finished checker
> > thread.
> >
> > > don't think that really makes a difference, but I don't really see
> > > much
> > > risk in the patch either, so I guess I'm fine either way.
> >
> > If the risk was truly zero, I would agree. But experience tells that
> > it's very easy to get these things wrong for corner cases, and I tend
> > to want to avoid even a minimal risk just for "very slight chance to
> > finish the checker a second sooner", as you've expressed it very
> > nicely.
> >
> > @Brian, feel free to try to convince us otherwise. It might be helpful
> > if we understood your use case better.
> >
> > Martin
> 
> 
> 
> -- 
> Brian Bunker
> PURE Storage, Inc.
> [email protected]


Reply via email to