On 06/08, Peter Zijlstra wrote:
>
> On Mon, Jun 08, 2015 at 11:14:17AM +0200, Peter Zijlstra wrote:
> > > Finally. Suppose that timer->function() returns HRTIMER_RESTART
> > > and hrtimer_active() is called right after __run_hrtimer() sets
> > > cpu_base->running = NULL. I can't understand why hrtimer_active()
> > > can't miss ENQUEUED in this case. We have wmb() in between, yes,
> > > but then hrtimer_active() should do something like
> > >
> > >   active = cpu_base->running == timer;
> > >   if (!active) {
> > >           rmb();
> > >           active = state != HRTIMER_STATE_INACTIVE;
> > >   }
> > >
> > > No?
> >
> > Hmm, good point. Let me think about that. It would be nice to be able to
> > avoid more memory barriers.
>
> So your scenario is:
>
>                               [R] seq
>                                 RMB
> [S] ->state = ACTIVE
>   WMB
> [S] ->running = NULL
>                               [R] ->running (== NULL)
>                               [R] ->state (== INACTIVE; fail to observe
>                                            the ->state store due to
>                                            lack of order)
>                                 RMB
>                               [R] seq (== seq)
> [S] seq++
>
> Conversely, if we re-order the (first) seq++ store such that it comes
> first:
>
> [S] seq++
>
>                               [R] seq
>                                 RMB
>                               [R] ->running (== NULL)
> [S] ->running = timer;
>   WMB
> [S] ->state = INACTIVE
>                               [R] ->state (== INACTIVE)
>                                 RMB
>                               [R] seq (== seq)
>
> And we have another false negative.
>
> And in this case we need the read order the other way around, we'd need:
>
>       active = timer->state != HRTIMER_STATE_INACTIVE;
>       if (!active) {
>               smp_rmb();
>               active = cpu_base->running == timer;
>       }
>
> Now I think we can fix this by either doing:
>
>       WMB
>       seq++
>       WMB
>
> On both sides of __run_hrtimer(), or do
>
> bool hrtimer_active(const struct hrtimer *timer)
> {
>       struct hrtimer_cpu_base *cpu_base;
>       unsigned int seq;
>
>       do {
>               cpu_base = READ_ONCE(timer->base->cpu_base);
>               seq = raw_read_seqcount(&cpu_base->seq);
>
>               if (timer->state != HRTIMER_STATE_INACTIVE)
>                       return true;
>
>               smp_rmb();
>
>               if (cpu_base->running == timer)
>                       return true;
>
>               smp_rmb();
>
>               if (timer->state != HRTIMER_STATE_INACTIVE)
>                       return true;
>
>       } while (read_seqcount_retry(&cpu_base->seq, seq) ||
>                cpu_base != READ_ONCE(timer->base->cpu_base));
>
>       return false;
> }

You know, I simply can't convince myself I understand why this code
correct... or not.

But contrary to what I said before, I agree that we need to recheck
timer->base. This probably needs more discussion, to me it is very
unobvious why we can trust this cpu_base != READ_ONCE() check. Yes,
we have a lot of barriers, but they do not pair with each other. Lets
ignore this for now.

> And since __run_hrtimer() is the more performance critical code, I think
> it would be best to reduce the amount of memory barriers there.

Yes, but wmb() is cheap on x86... Perhaps we can make this code
"obviously correct" ?


How about the following..... We add cpu_base->seq as before but
limit its "write" scope so that we cam use the regular read/retry.

So,

        hrtimer_active(timer)
        {

                do {
                        base = READ_ONCE(timer->base->cpu_base);
                        seq = read_seqcount_begin(&cpu_base->seq);

                        if (timer->state & ENQUEUED ||
                            base->running == timer)
                                return true;

                } while (read_seqcount_retry(&cpu_base->seq, seq) ||
                         base != READ_ONCE(timer->base->cpu_base));

                return false;
        }

And we need to avoid the races with 2 transitions in __run_hrtimer().

The first race is trivial, we change __run_hrtimer() to do

        write_seqcount_begin(cpu_base->seq);
        cpu_base->running = timer;
        __remove_hrtimer(timer);        // clears ENQUEUED
        write_seqcount_end(cpu_base->seq);

and hrtimer_active() obviously can't race with this section.

Then we change enqueue_hrtimer()


        +       bool need_lock = base->cpu_base->running == timer;
        +       if (need_lock)
        +               write_seqcount_begin(cpu_base->seq);
        +
                timer->state |= HRTIMER_STATE_ENQUEUED;
        +
        +       if (need_lock)
        +               write_seqcount_end(cpu_base->seq);


Now. If the timer is re-queued by the time __run_hrtimer() clears
->running we have the following sequence:

        write_seqcount_begin(cpu_base->seq);
        timer->state |= HRTIMER_STATE_ENQUEUED;
        write_seqcount_end(cpu_base->seq);

        base->running = NULL;

and I think this should equally work, because in this case we do not
care if hrtimer_active() misses "running = NULL".

Yes, we only have this 2nd write_seqcount_begin/end if the timer re-
arms itself, but otherwise we do not race. If another thread does
hrtime_start() in between we can pretend that hrtimer_active() hits
the "inactive".

What do you think?


And. Note that we can rewrite these 2 "write" critical sections in
__run_hrtimer() and enqueue_hrtimer() as

        cpu_base->running = timer;

        write_seqcount_begin(cpu_base->seq);
        write_seqcount_end(cpu_base->seq);

        __remove_hrtimer(timer);

and

        timer->state |= HRTIMER_STATE_ENQUEUED;

        write_seqcount_begin(cpu_base->seq);
        write_seqcount_end(cpu_base->seq);

        base->running = NULL;

So we can probably use write_seqcount_barrier() except I am not sure
about the 2nd wmb...

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to