On 11/28/23 16:27, Steven Rostedt wrote:
> On Tue, 28 Nov 2023 16:05:00 +0100
> Petr Pavlu <petr.pa...@suse.com> wrote:
> 
>> On 11/27/23 18:41, Steven Rostedt wrote:
>>> On Mon, 27 Nov 2023 16:12:47 +0100
>>> Petr Pavlu <petr.pa...@suse.com> wrote:
>>>   
>>>> The following warning appears when using buffered events:
>>>> [  203.556451] WARNING: CPU: 53 PID: 10220 at 
>>>> kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420  
>>>
>>> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
>>> in your version?  
>>
>> The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
>> code and RB_WARN_ON():
> 
> Interesting.
> 
>>   3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
>>   3896                                  struct ring_buffer_event *event)
>>   3897  {
>> [...]
>>   3908           * This must only be called if the event has not been
>>   3909           * committed yet. Thus we can assume that preemption
>>   3910           * is still disabled.
>>   3911           */
>>   3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
>>   3913  
>>   3914          rb_decrement_entry(cpu_buffer, event);
>>   3915          if (rb_try_to_discard(cpu_buffer, event))
>>   3916                  goto out;
>>
>> https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912
>>
>>>> [...]
>>>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G         
>>>>    E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
> 
> I was looking at the above 6.7.0-rc2-default, which made me somewhat confused.
> 
> 
>>>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS 
>>>> GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>>>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>>>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 
>>>> 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 
>>>> <0f> 0b e9 77 
> 
>>>> * Function trace_buffered_event_enable() initializes memory of each
>>>>   allocated trace_buffered_event to zero but no memory barrier is
>>>>   present to ensure this operation is completed before publishing its
>>>>   pointer for use in trace_event_buffer_lock_reserve().  
>>>
>>> Yes, a wmb() may be helpful, but as the reader will see either the
>>> allocated page or NULL, and can handle both cases. I decided not to add it
>>> because it really doesn't matter if we do.  
>>
>> My concern is about the earlier call "memset(event, 0, sizeof(*event));"
>> in trace_buffered_event_enable() which writes the obtained event page
>> prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
>> I think another CPU running trace_event_buffer_lock_reserve() could see
>> already the valid event pointer, write to the event page and only then
>> the memory could see the memset() write coming from
>> trace_buffered_event_enable(). The event data would then become
>> corrupted.
> 
> OK, I'll buy that. Feel free to send an patch that adds that wmb with a
> comment explaining the above.

Sure, I can do that. A nice thing about my proposed patch is that it
requires only one smp_wmb() in this case, while the current code will
require adding this barrier in the allocation loop prior to writing
trace_buffered_event.

>>>   
>>>> * Calling function trace_buffered_event_enable() normally requires
>>>>   pairing it with trace_buffered_event_disable(). However, the function
>>>>   has no error return code and in case of a failure decrements back
>>>>   trace_buffered_event_ref. This results in underflow of the counter
>>>>   when trace_buffered_event_disable() gets called later.  
>>>
>>> Not an issue.
>>>
>>> Even on failure the ref count will be greater than zero, where 
>>> trace_buffered_event_disable() handles it properly, and the freeing can
>>> handle no pages being allocated.
>>>
>>> That is:
>>>
>>>     free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>>>
>>> would be the same as: free_page((unsigned long)NULL);
>>>
>>> Which is perfectly fine to do.  
>>
>> The situation that I have on my mind is the following:
>> * The counter trace_buffered_event_ref is at 0.
>> * The soft mode gets enabled for some event and
>>   trace_buffered_event_enable() is called. The function increments
>>   trace_buffered_event_ref to 1 and starts allocating event pages.
>> * The allocation fails for some page and trace_buffered_event_disable()
>>   is called for cleanup.
>> * Function trace_buffered_event_disable() decrements
>>   trace_buffered_event_ref back to 0, recognizes that it was the last
>>   use of buffered events and frees all allocated pages.
>> * The control goes back to trace_buffered_event_enable() which returns.
>>   The caller of trace_buffered_event_enable() has no information that
>>   the function actually failed.
>> * Some time later, the soft mode is disabled for the same event.
>>   Function trace_buffered_event_disable() is called. It warns on
>>   "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.
>>
>> The end result is only an ugly warning but I think it is still good to
>> fix.
> 
> Ah, I missed the call to trace_buffered_event_disable() in the failure
> path. Actually, since the trace_buffered_event is only an optimization, I
> think the best thing to do is to just return without the clean up,
> expecting that the other caller will do the clean up. We could even not
> even stop the loop!
> 
> That is, just have:
> 
> void trace_buffered_event_enable(void)
> {
>       struct ring_buffer_event *event;
>       struct page *page;
>       int cpu;
> 
>       WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
>       if (trace_buffered_event_ref++)
>               return;
> 
>       for_each_tracing_cpu(cpu) {
>               page = alloc_pages_node(cpu_to_node(cpu),
>                                       GFP_KERNEL | __GFP_NORETRY, 0);
> 
>               /* This is just an optimization and can handle failures */
>               if (!page) {
>                       pr_err("Failed to create event buffer\n");
>                       break;
>               }
> 
>               event = page_address(page);
>               memset(event, 0, sizeof(*event));
> 
>               per_cpu(trace_buffered_event, cpu) = event;
> 
>               preempt_disable();
>               if (cpu == smp_processor_id() &&
>                   __this_cpu_read(trace_buffered_event) !=
>                   per_cpu(trace_buffered_event, cpu))
>                       WARN_ON_ONCE(1);
>               preempt_enable();
>       }
> }

The current behavior is that if the allocation in one call to
trace_buffered_event_enable() fails then another call to the same
function results in a new attempt to allocate the event pages.

The code above changes it. As long trace_buffered_event_ref doesn't go
back to 0 and then 1 again, no new allocation attempt is made. It means
that trace_buffered_event can be left in this "half-initialized" state
for a while.

This is possibly ok, it is an unlikely corner case anyway. My patch
nonetheless preserves the current behavior of retrying the allocation.

> 
>>>
>>> Now, the reason you found the crash was from the first issue you mentioned.
>>> That is, we didn't disable the current CPU and if we migrated, things would
>>> get really messed up. The only fix here is to make sure all CPUs have their
>>> trace_buffered_event_cnt get incremented.  
>>
>> The potential race that I see is the following:
>> * Function trace_buffered_event_disable() is called on CPU 0. It
>>   increments trace_buffered_event_cnt on each CPU and waits via
>>   synchronize_rcu() for each user of trace_buffered_event to complete.
> 
>> * After synchronize_rcu() is already finished, the situation is that all
>>   counters trace_buffered_event_cnt are at 1 and all pointers
>>   trace_buffered_event are still valid.
> 
>> * At this point, on a different CPU 1, the execution reaches
>>   trace_event_buffer_lock_reserve(). The function calls
>>   preempt_disable_notrace() and enters an RCU read-side critical section
>>   but that is not important because all RCU stuff is already performed
>>   and completed in trace_buffered_event_disable(). The function proceeds
>>   and reads a still valid pointer from trace_buffered_event[CPU1] into
>>   the local variable "entry". However, it doesn't yet read
>>   trace_buffered_event_cnt[CPU1] which happens later.
> 
>> * Function trace_buffered_event_disable() continues. It frees
>>   trace_buffered_event[CPU1] and decrements
>>   trace_buffered_event_cnt[CPU1] back to 0.
> 
>> * Function trace_event_buffer_lock_reserve() continues. It reads and
>>   increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
>>   believe that it can use the "entry" that it already obtained but the
>>   pointer is now invalid and any access results in a use-after-free.
> 
> OK, so you are saying we need another synchronize_rcu() after we set the
> trace_buffered_event to NULL?
> 
> 
> void trace_buffered_event_disable(void)
> {
>       int cpu;
> 
>       WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
>       if (WARN_ON_ONCE(!trace_buffered_event_ref))
>               return;
> 
>       if (--trace_buffered_event_ref)
>               return;
> 
>       preempt_disable();
>       /* For each CPU, set the buffer as used. */
>       smp_call_function_many(tracing_buffer_mask,
>                              disable_trace_buffered_event, NULL, 1);
>       preempt_enable();
> 
>       /* Wait for all current users to finish */
>       synchronize_rcu();
> 
>       for_each_tracing_cpu(cpu) {
>               free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>               per_cpu(trace_buffered_event, cpu) = NULL;
>       }
>       /*
>        * Make sure trace_buffered_event is NULL before clearing
>        * trace_buffered_event_cnt.
>        */
> -     smp_wmb();
> +     synchronize_rcu();
> 
>       preempt_disable();
>       /* Do the work on each cpu */
>       smp_call_function_many(tracing_buffer_mask,
>                              enable_trace_buffered_event, NULL, 1);
>       preempt_enable();
> }
> 
> I agree with that.

Yes, I believe this should address this potential race condition.

An alternative would be instead to update
trace_event_buffer_lock_reserve() as follows:

        if (this_cpu_inc_return(trace_buffered_event_cnt) == 1) {
                smp_rmb();
                if ((entry = __this_cpu_read(trace_buffered_event))) {
                        [...]

That saves the synchronize_rcu() call but additionally modifies
trace_buffered_event_cnt even if trace_buffered_event is currently NULL.

Another alternative is the approach taken by my patch which avoids more
RCU work and unnecessary memory modifications.

I'd be interested if you could have a look again at what I'm proposing
in my patch. I think it simplifies the code while addressing these
problems as well. However, if you have reservations about that approach
then it is ok, I can fix the found problems individually as discussed.

Thanks,
Petr

Reply via email to