When tracing the behavior of multiple fio jobs running in parallel our performance team observed that some scsi_dispatch_cmd_done events appeared to occur earlier, often several microseconds earlier, than their associated scsi_dispatch_cmd_start event in the trace records. Other interrupt events were also observed to have earlier timestamps than the events that caused them.
This incorrect chronological ordering of trace records occurs because ALL nested writes have the same time stamp as the first writer, that is the first writer in the stack that was preempted. In workloads where interrupts occur frequently, the first writer can stay preempted across multiple interrupts and nested trace events can record time stamps that are many microseconds earlier than their actual value resulting in the wrong ordering. For example, a scsi_dispatch_cmd_start on CPU A occurs, say at time t1, and the corresponding scsi_dispatch_cmd_done occurs on another CPU B, say at time t2. However, CPU B is in the middle of a nested write with the first writer having recorded its event at time t0, where t0 < t1 < t2. Eventually, we get out of all the nested writes and the first writer commits its trace event with a time of t0 and the the nested scsi_dispatch_cmd_done gets assigned the same timestamp t0. In the trace report, the scsi_dispatch_cmd_done thus appears to have occurred before the scsi_dispatch_cmd_start. In some cases, on larger systems with multiple fio jobs running and interrupts being sent to a single CPU, we have noticed more than 400 trace events all with the same timestamp. The reason all nested writes have the same time stamp is because all nested trace events are assigned a time delta of 0. A correct time delta cannot be computed for them because the first interrupted write has not been committed yet and so the commit timestamp in the CPU buffer does not record the time stamp of the just prior event. One way of fixing this is to keep a timestamp of the last event we reserved space for in the ring buffer, so that we can compute the correct time delta for each event. We also need a short critical section where we cannot be interrupted, in which we: 1. Read the current time 2. Compute the time delta for this event from the last time stamp 3. Update the value of the last time stamp 4. Reserve space in the ring buffer for the event, Although the ring buffer design is carefully coded to avoid disabling interrupts, in this case there does not appear to be a practical way to solve this problem without disabling interrupts for a short time. To accommodate those architectures where disabling interrupts is expensive, a kernel tunable (/sys/kernel/debug/tracing/nested_precise_ts) needs to be set to 1 to enable this behavior and get accurate timestamps. Signed-off-by: Suresh Warrier <warr...@linux.vnet.ibm.com> --- kernel/trace/ring_buffer.c | 71 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 67 insertions(+), 4 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index c9b3005..0a2d862 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -482,6 +482,7 @@ struct ring_buffer_per_cpu { unsigned long read_bytes; u64 write_stamp; u64 read_stamp; + u64 last_stamp; /* ring buffer pages to update, > 0 to add, < 0 to remove */ int nr_pages_to_update; struct list_head new_pages; /* new pages to add */ @@ -2010,7 +2011,8 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer, { /* Only a commit updates the timestamp */ if (unlikely(!rb_event_is_commit(cpu_buffer, event))) - delta = 0; + if (!rb_precise_nested_write_ts()) + delta = 0; /* * If we need to add a timestamp, then we @@ -2534,6 +2536,60 @@ void rb_disable_precise_nested_write_ts(void) static_key_slow_dec(&__precise_nested_write_ts); } +/** + * get_write_timestamp: + * Must be called before we read the current time stamp + * It returns a pointer to the location of the last + * time stamp to be used in the delta calculation. + * If precise nested write timestamps are enabled, it first + * disables interrupts on the current processor so that + * we can reserve space on the buffer and save the event's + * timestamp without being preempted. + * + * put_write_timestamp: + * Must be called after we reserve space in the ring buffer. + * If precise nested write timestamps are enabled, it saves the + * timestamp in the specified timestamp location (passed in + * as *pstamp) so that the nested writers always have a valid + * timestamp to compute the timestamp deltas for their events. + * This must be done before we re-enable interrupts. + * + */ +static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer, + unsigned long *flags) +{ + if (rb_precise_nested_write_ts()) { + /* + * Ensure that we are not preempted until after we update + * the write timestamp. + */ + local_irq_save(*flags); + return &cpu_buffer->last_stamp; + } else { + return &cpu_buffer->write_stamp; + } +} + +static void put_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer, + u64 *pstamp, + u64 ts, + unsigned long *flags) +{ + /* + * Rather than make the rb_precise_nested_write_ts() + * call (this would be a NOP for the default case) to check + * if we need to restore interrupts, we need to explicitly + * test if pstamp points to the last_stamp field below to + * handle the case where the nested_precise_ts sysfs tunable + * was updated between the get_write_timestamp() and + * put_write_timestamp() routines. + */ + if (&cpu_buffer->last_stamp == pstamp) { + *pstamp = ts; + local_irq_restore(*flags); + } +} + static struct ring_buffer_event * rb_reserve_next_event(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer, @@ -2544,6 +2600,8 @@ rb_reserve_next_event(struct ring_buffer *buffer, int nr_loops = 0; int add_timestamp; u64 diff; + u64 *pstamp; + unsigned long flags = 0; /* GCC */ rb_start_commit(cpu_buffer); @@ -2579,14 +2637,16 @@ rb_reserve_next_event(struct ring_buffer *buffer, if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)) goto out_fail; + pstamp = get_write_timestamp(cpu_buffer, &flags); + ts = rb_time_stamp(cpu_buffer->buffer); - diff = ts - cpu_buffer->write_stamp; + diff = ts - *pstamp; /* make sure this diff is calculated here */ barrier(); /* Did the write stamp get updated already? */ - if (likely(ts >= cpu_buffer->write_stamp)) { + if (likely(ts >= *pstamp)) { delta = diff; if (unlikely(test_time_stamp(delta))) { int local_clock_stable = 1; @@ -2597,7 +2657,7 @@ rb_reserve_next_event(struct ring_buffer *buffer, KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s", (unsigned long long)delta, (unsigned long long)ts, - (unsigned long long)cpu_buffer->write_stamp, + (unsigned long long)*pstamp, local_clock_stable ? "" : "If you just came from a suspend/resume,\n" "please switch to the trace global clock:\n" @@ -2608,6 +2668,9 @@ rb_reserve_next_event(struct ring_buffer *buffer, event = __rb_reserve_next(cpu_buffer, length, ts, delta, add_timestamp); + + put_write_timestamp(cpu_buffer, pstamp, ts, &flags); + if (unlikely(PTR_ERR(event) == -EAGAIN)) goto again; -- 1.8.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/