On Mon, 18 Dec 2023 23:07:12 -0500
Steven Rostedt <rost...@goodmis.org> wrote:

> From: "Steven Rostedt (Google)" <rost...@goodmis.org>
> 
> To synchronize the timestamps with the ring buffer reservation, there are
> two timestamps that are saved in the buffer meta data.
> 
> 1. before_stamp
> 2. write_stamp
> 
> When the two are equal, the write_stamp is considered valid, as in, it may
> be used to calculate the delta of the next event as the write_stamp is the
> timestamp of the previous reserved event on the buffer.
> 
> This is done by the following:
> 
>  /*A*/        w = current position on the ring buffer
>       before = before_stamp
>       after = write_stamp
>       ts = read current timestamp
> 
>       if (before != after) {
>               write_stamp is not valid, force adding an absolute
>               timestamp.

(additional comment)
This happens if this caller interrupts another reservation process's B to E.
(thus the original one only updates "before_stamp", but "write_stamp" is old.)

>       }
> 
>  /*B*/        before_stamp = ts
> 
>  /*C*/        write = local_add_return(event length, position on ring buffer)
> 
>       if (w == write - event length) {
>               /* Nothing interrupted between A and C */
>  /*E*/                write_stamp = ts;
>               delta = ts - after
>               /*
>                * If nothing interrupted again,
>                * before_stamp == write_stamp and write_stamp
>                * can be used to calculate the delta for
>                * events that come in after this one.
>                */

(additional comment)
If something interrupts between C and E, the write_stamp goes backward
because interrupted one updates the before_stamp and write_stamp with
new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts).
Thus anyway the next entry will use absolute time stamp forcibly.

>       } else {
> 
>               /*
>                * The slow path!
>                * Was interrupted between A and C.
>                */
> 
> This is the place that there's a bug. We currently have:
> 
>               after = write_stamp
>               ts = read current timestamp
> 
>  /*F*/                if (write == current position on the ring buffer &&
>                   after < ts && cmpxchg(write_stamp, after, ts)) {
> 
>                       delta = ts - after;
> 
>               } else {
>                       delta = 0;
>               }
> 
> The assumption is that if the current position on the ring buffer hasn't
> moved between C and F, then it also was not interrupted, and that the last
> event written has a timestamp that matches the write_stamp. That is the
> write_stamp is valid.
> 
> But this may not be the case:
> 
> If a task context event was interrupted by softirq between B and C.
> 
> And the softirq wrote an event that got interrupted by a hard irq between
> C and E.
> 
> and the hard irq wrote an event (does not need to be interrupted)
> 
> We have:
> 
>  /*B*/ before_stamp = ts of normal context
> 
>    ---> interrupted by softirq
> 
>       /*B*/ before_stamp = ts of softirq context
> 
>         ---> interrupted by hardirq
> 
>               /*B*/ before_stamp = ts of hard irq context
>               /*E*/ write_stamp = ts of hard irq context
> 
>               /* matches and write_stamp valid */
>         <----
> 
>       /*E*/ write_stamp = ts of softirq context
> 
>       /* No longer matches before_stamp, write_stamp is not valid! */
> 
>    <---
> 
>  w != write - length, go to slow path
> 
> // Right now the order of events in the ring buffer is:
> //
> // |-- softirq event --|-- hard irq event --|-- normal context event --|
> //
> 
>  after = write_stamp (this is the ts of softirq)
>  ts = read current timestamp
> 
>  if (write == current position on the ring buffer [true] &&
>      after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
> 
>       delta = ts - after  [Wrong!]
> 
> The delta is to be between the hard irq event and the normal context
> event, but the above logic made the delta between the softirq event and
> the normal context event, where the hard irq event is between the two. This
> will shift all the remaining event timestamps on the sub-buffer
> incorrectly.
> 
> The write_stamp is only valid if it matches the before_stamp. The cmpxchg
> does nothing to help this.

That's right. Even if someone interrupts between A and C, we can write
the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new)
so the next entry will forcibly record the absolute timestamp.
In this case, what we need to do here is using the absolute timestamp instead
of delta.

> 
> Instead, the following logic can be done to fix this:
> 
>       before = before_stamp
>       ts = read current timestamp
>       before_stamp = ts
> 
>       after = write_stamp
> 
>       if (write == current position on the ring buffer &&
>           after == before && after < ts) {
> 
>               delta = ts - after
> 
>       } else {
>               delta = 0;
>       }

Ah, this is a good idea. Instead of using the old timestamp, use
delta = 0, thus it will reuse the interrupted timestamp if someone
interrupts between A and C.

Yeah the above works, but my question is, do we really need this
really slow path? I mean;

>       if (w == write - event length) {
>               /* Nothing interrupted between A and C */
>  /*E*/                write_stamp = ts;
>               delta = ts - after

        } else {
                /*
                  Something interrupted between A and C, which should record
                  a new entry before this reserved entry with newer timestamp.
                  we reuse it.
                 */
                ts = after = write_stamp;
                delta = 0;
        }

Isn't this enough?

Thank you,

> 
> The above will only use the write_stamp if it still matches before_stamp
> and was tested to not have changed since C.
> 
> As a bonus, with this logic we do not need any 64-bit cmpxchg() at all!
> 
> This mean the 32-bit rb_time_t workaround can finally be removed. But
> that's for a later time.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3...@gandalf.local.home/
> 
> Cc: sta...@vger.kernel.org
> Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp")
> Signed-off-by: Steven Rostedt (Google) <rost...@goodmis.org>
> ---
>  kernel/trace/ring_buffer.c | 79 ++++++++++++--------------------------
>  1 file changed, 24 insertions(+), 55 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 5a114e752f11..e7055f52afe0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -700,48 +700,6 @@ rb_time_read_cmpxchg(local_t *l, unsigned long expect, 
> unsigned long set)
>       return local_try_cmpxchg(l, &expect, set);
>  }
>  
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -     unsigned long cnt, top, bottom, msb;
> -     unsigned long cnt2, top2, bottom2, msb2;
> -     u64 val;
> -
> -     /* Any interruptions in this function should cause a failure */
> -     cnt = local_read(&t->cnt);
> -
> -     /* The cmpxchg always fails if it interrupted an update */
> -      if (!__rb_time_read(t, &val, &cnt2))
> -              return false;
> -
> -      if (val != expect)
> -              return false;
> -
> -      if ((cnt & 3) != cnt2)
> -              return false;
> -
> -      cnt2 = cnt + 1;
> -
> -      rb_time_split(val, &top, &bottom, &msb);
> -      msb = rb_time_val_cnt(msb, cnt);
> -      top = rb_time_val_cnt(top, cnt);
> -      bottom = rb_time_val_cnt(bottom, cnt);
> -
> -      rb_time_split(set, &top2, &bottom2, &msb2);
> -      msb2 = rb_time_val_cnt(msb2, cnt);
> -      top2 = rb_time_val_cnt(top2, cnt2);
> -      bottom2 = rb_time_val_cnt(bottom2, cnt2);
> -
> -     if (!rb_time_read_cmpxchg(&t->cnt, cnt, cnt2))
> -             return false;
> -     if (!rb_time_read_cmpxchg(&t->msb, msb, msb2))
> -             return false;
> -     if (!rb_time_read_cmpxchg(&t->top, top, top2))
> -             return false;
> -     if (!rb_time_read_cmpxchg(&t->bottom, bottom, bottom2))
> -             return false;
> -     return true;
> -}
> -
>  #else /* 64 bits */
>  
>  /* local64_t always succeeds */
> @@ -755,11 +713,6 @@ static void rb_time_set(rb_time_t *t, u64 val)
>  {
>       local64_set(&t->time, val);
>  }
> -
> -static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set)
> -{
> -     return local64_try_cmpxchg(&t->time, &expect, set);
> -}
>  #endif
>  
>  /*
> @@ -3610,20 +3563,36 @@ __rb_reserve_next(struct ring_buffer_per_cpu 
> *cpu_buffer,
>       } else {
>               u64 ts;
>               /* SLOW PATH - Interrupted between A and C */
> -             a_ok = rb_time_read(&cpu_buffer->write_stamp, &info->after);
> -             /* Was interrupted before here, write_stamp must be valid */
> +
> +             /* Save the old before_stamp */
> +             a_ok = rb_time_read(&cpu_buffer->before_stamp, &info->before);
>               RB_WARN_ON(cpu_buffer, !a_ok);
> +
> +             /*
> +              * Read a new timestamp and update the before_stamp to make
> +              * the next event after this one force using an absolute
> +              * timestamp. This is in case an interrupt were to come in
> +              * between E and F.
> +              */
>               ts = rb_time_stamp(cpu_buffer->buffer);
> +             rb_time_set(&cpu_buffer->before_stamp, ts);
> +
> +             barrier();
> + /*E*/               a_ok = rb_time_read(&cpu_buffer->write_stamp, 
> &info->after);
> +             /* Was interrupted before here, write_stamp must be valid */
> +             RB_WARN_ON(cpu_buffer, !a_ok);
>               barrier();
> - /*E*/               if (write == (local_read(&tail_page->write) & 
> RB_WRITE_MASK) &&
> -                 info->after < ts &&
> -                 rb_time_cmpxchg(&cpu_buffer->write_stamp,
> -                                 info->after, ts)) {
> -                     /* Nothing came after this event between C and E */
> + /*F*/               if (write == (local_read(&tail_page->write) & 
> RB_WRITE_MASK) &&
> +                 info->after == info->before && info->after < ts) {
> +                     /*
> +                      * Nothing came after this event between C and F, it is
> +                      * safe to use info->after for the delta as it
> +                      * matched info->before and is still valid.
> +                      */
>                       info->delta = ts - info->after;
>               } else {
>                       /*
> -                      * Interrupted between C and E:
> +                      * Interrupted between C and F:
>                        * Lost the previous events time stamp. Just set the
>                        * delta to zero, and this will be the same time as
>                        * the event this event interrupted. And the events that
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google) <mhira...@kernel.org>

Reply via email to