date:20240312

Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-12 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
>
> * Abstract common functions into `find_alloc_memory_type()`

We should move kmem_put_memory_types() (renamed to
mt_put_memory_types()?) too.  This can be put in a separate patch.

> Since different memory devices require finding or allocating a memory type,
> these common steps are abstracted into a single function,
> `find_alloc_memory_type()`, enhancing code scalability and conciseness.
>
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
>
> * Change adist calculation code to use another new lock, mt_perf_lock.
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `mt_perf_lock` to protect
> `default_dram_perf`. This approach not only avoids deadlock but also
> prevents holding a large lock simultaneously.
>
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> ---
>  drivers/acpi/numa/hmat.c | 11 ++
>  drivers/dax/kmem.c   | 13 +--
>  include/linux/acpi.h |  6 
>  include/linux/memory-tiers.h |  8 +
>  mm/memory-tiers.c| 70 +---
>  5 files changed, 92 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index d6b85f0f6082..28812ec2c793 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
>  static LIST_HEAD(initiators);
>  static LIST_HEAD(localities);
>  
> +static LIST_HEAD(hmat_memory_types);
> +

HMAT isn't a device driver for some memory devices.  So I don't think we
should manage memory types in HMAT.  Instead, if the memory_type of a
node isn't set by the driver, we should manage it in memory-tier.c as
fallback.

>  static DEFINE_MUTEX(target_lock);
>  
>  /*
> @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
>  }
>  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
>  
> +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
> +{
> + return find_alloc_memory_type(adist, &hmat_memory_types);
> +}
> +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
> +
>  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
>  {
>   struct memory_initiator *initiator;
> @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
>   if (!hmat_set_default_dram_perf())
>   register_mt_adistance_algorithm(&hmat_adist_nb);
>  
> + /* Post-create CPUless memory tiers after getting HMAT info */
> + memory_tier_late_init();
> +

This should be called in memory-tier.c via

late_initcall(memory_tier_late_init);

Then, we don't need hmat to call it.

>   return 0;
>  out_put:
>   hmat_free_structures();
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 42ee360cf4e3..aee17ab59f4f 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
>  
>  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
>  {
> - bool found = false;
>   struct memory_dev_type *mtype;
>  
>   mutex_lock(&kmem_memory_type_lock);
> - list_for_each_entry(mtype, &kmem_memory_types, list) {
> - if (mtype->adistance == adist) {
> - found = true;
> - break;
> - }
> - }
> - if (!found) {
> - mtype = alloc_memory_type(adist);
> - if (!IS_ERR(mtype))
> - list_add(&mtype->list, &kmem_memory_types);
> - }
> + mtype = find_alloc_memory_type(adist, &kmem_memory_types);
>   mutex_unlock(&kmem_memory_type_lock);
>  
>   return mtype;
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index b7165e52b3c6..3f927ff01f02 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -434,12 +434,18 @@ int thermal_acpi_critical_trip_temp(struct acpi_device 
> *adev, int *ret_temp);
>  
>  #ifdef CONFIG_ACPI_HMAT
>  int acpi_get_genport_coordinates(u32 uid, struct access_coordinate *coord);
> +struct memory_dev_type *hmat_fi

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-12 Thread Luis Machado

On 3/11/24 17:05, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
>> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
>>> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
 On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
>> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:

  Summary 

 In my (non-vhost experience) opinion the way to go would be either
 replacing the cond_resched with a hard schedule or setting the
 need_resched flag within vhost if the a data transfer was successfully
 initiated. It will be necessary to check if this causes problems with
 other workloads/benchmarks.
>>>
>>> Yes but conceptually I am still in the dark on whether the fact that
>>> periodically invoking cond_resched is no longer sufficient to be nice to
>>> others is a bug, or intentional.  So you feel it is intentional?
>>
>> I would assume that cond_resched is still a valid concept.
>> But, in this particular scenario we have the following problem:
>>
>> So far (with CFS) we had:
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> 3. CFS gives priority to woken up task and schedules it
>> 4. kworker runs
>>
>> Now (with EEVDF) we have:
>> 0. In some cases, kworker has accumulated negative lag 
>> 1. vhost initiates data transfer
>> 2. kworker is woken up
>> -3a. EEVDF does not schedule kworker if it has negative lag
>> -4a. vhost continues running, kworker on same CPU starves
>> --
>> -3b. EEVDF schedules kworker if it has positive or no lag
>> -4b. kworker runs
>>
>> In the 3a/4a case, the kworker is given no chance to set the
>> necessary flag. The flag can only be set by another CPU now.
>> The schedule of the kworker was not caused by cond_resched, but
>> rather by the wakeup path of the scheduler.
>>
>> cond_resched works successfully once the load balancer (I suppose) 
>> decides to migrate the vhost off to another CPU. In that case, the
>> load balancer on another CPU sets that flag and we are good.
>> That then eventually allows the scheduler to pick kworker, but very
>> late.
> 
> Are we going anywhere with this btw?
> 
>

I think Tobias had a couple other threads related to this, with other potential 
fixes:

https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/

https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/

Re: [PATCHv2 1/2] dt-bindings: usb: typec: anx7688: start a binding document

2024-03-12 Thread Krzysztof Kozlowski

On 11/03/2024 22:22, Pavel Machek wrote:
> Hi!
> 
>> Add binding for anx7688 usb type-c bridge. I don't have a datasheet,
>> but I did best I could.
>>
>> Signed-off-by: Pavel Machek 
> 
> Any more comments here? Automatic system told me I need to replace one
> character...

Well, I often do not review patches which have build failure so were not
compiled. In the terms of bindings `dt_binding_check` is like test
compilation of C, so just like no one reviews onbuildable C code, I
don't usually look at "unbuildable" bindings.

Plus original v2 patch went into some other email thread, because of
"References" field, so a bit disappeared from my inbox.

Best regards,
Krzysztof

[PATCH 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If a reader of the ring buffer is doing a poll, and waiting for the ring
buffer to hit a specific watermark, there could be a case where it gets
into an infinite ping-pong loop.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

In the poll, the shortest_full needs to be set before the
full_pending_waiters, as once that is set, the writer will compare the
current shortest_full (which is incorrect) to decide to call the irq_work,
which will reset the shortest_full (expecting the readers to update it).

Also move the setting of full_waiters_pending after the check if the ring
buffer has the required percentage filled. There's no reason to tell the
writer to wake up waiters if there are no waiters.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b..adfe603a769b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
poll_wait(filp, &rbwork->full_waiters, poll_table);
 
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-   rbwork->full_waiters_pending = true;
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-   } else {
-   poll_wait(filp, &rbwork->waiters, poll_table);
-   rbwork->waiters_pending = true;
+   if (full_hit(buffer, cpu, full))
+   return EPOLLIN | EPOLLRDNORM;
+   /*
+* Only allow full_waiters_pending update to be seen after
+* the shortest_full is set. If the writer sees the
+* full_waiters_pending flag set, it will compare the
+* amount in the ring buffer to shortest_full. If the amount
+* in the ring buffer is greater than the shortest_full
+* percent, it will call the irq_work handler to wake up
+* this list. The irq_handler will reset shortest_full
+* back to zero. That's done under the reader_lock, but
+* the below smp_mb() makes sure that the update to
+* full_waiters_pending doesn't leak up into the above.
+*/
+   smp_mb();
+   rbwork->full_waiters_pending = true;
+   return 0;
}
 
+   poll_wait(filp, &rbwork->waiters, poll_table);
+   rbwork->waiters_pending = true;
+
/*
 * There's a tight race between setting the waiters_pending and
 * checking if the ring buffer is empty.  Once the waiters_pending bit
@@ -989,9 +1005,

[PATCH 0/2] ring-buffer: Fix poll wakeup logic

2024-03-12 Thread Steven Rostedt



After making a slight change to wakeups in ring_buffer_wait()
the system would hang. Spending several hours going on a wild goose
chase I found that the change only triggered the bug because it
changed the timings. The bug was there before the update but never
was triggered.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

The race was triggered when running:

  trace-cmd record -p function -m 5000

Which enables function tracing and then creates two files it is writing
into where each is 2500K in size. The -m is a "max file size". When
trace-cmd writes 2500K to one file it then switches to the other, erasing
the old data. To do this, trace-cmd switches between both poll and
the reader using both methods of wake up. The change to the reader wakeup
was able to change the way the poll was woken to trigger this bug.

The second patch is a clean up and also a way to consolidate the logic
of the shortest_full. The read wakeup uses rb_watermark_hit for both
full wakeups and !full wakeups. But since poll uses the same logic for
full wakeups it can just call that function with full set.


Steven Rostedt (Google) (2):
  ring-buffer: Fix full_waiters_pending in poll
  ring-buffer: Reuse rb_watermark_hit() for the poll logic


 kernel/trace/ring_buffer.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

[PATCH 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The check for knowing if the poll should wait or not is basically the
exact same logic as rb_watermark_hit(). The only difference is that
rb_watermark_hit() also handles the !full case. But for the full case, the
logic is the same. Just call that instead of duplicating the code in
ring_buffer_poll_wait().

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index adfe603a769b..6ef763f57c66 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -963,21 +963,16 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
 
poll_wait(filp, &rbwork->full_waiters, poll_table);
 
-   raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
-   raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-   if (full_hit(buffer, cpu, full))
+   if (rb_watermark_hit(buffer, cpu, full))
return EPOLLIN | EPOLLRDNORM;
/*
 * Only allow full_waiters_pending update to be seen after
-* the shortest_full is set. If the writer sees the
-* full_waiters_pending flag set, it will compare the
-* amount in the ring buffer to shortest_full. If the amount
-* in the ring buffer is greater than the shortest_full
-* percent, it will call the irq_work handler to wake up
-* this list. The irq_handler will reset shortest_full
+* the shortest_full is set (in rb_watermark_hit). If the
+* writer sees the full_waiters_pending flag set, it will
+* compare the amount in the ring buffer to shortest_full.
+* If the amount in the ring buffer is greater than the
+* shortest_full percent, it will call the irq_work handler
+* to wake up this list. The irq_handler will reset 
shortest_full
 * back to zero. That's done under the reader_lock, but
 * the below smp_mb() makes sure that the update to
 * full_waiters_pending doesn't leak up into the above.
-- 
2.43.0

[PATCH v3 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Convert ring_buffer_wait() over to wait_event_interruptible(). The default
condition is to execute the wait loop inside __wait_event() just once.

This does not change the ring_buffer_wait() prototype yet, but
restructures the code so that it can take a "cond" and "data" parameter
and will call wait_event_interruptible() with a helper function as the
condition.

The helper function (rb_wait_cond) takes the cond function and data
parameters. It will first check if the buffer hit the watermark defined by
the "full" parameter and then call the passed in condition parameter. If
either are true, it returns true.

If rb_wait_cond() does not return true, it will set the appropriate
"waiters_pending" flag and returns false.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   1 +
 kernel/trace/ring_buffer.c  | 116 +---
 2 files changed, 69 insertions(+), 48 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..338a33db1577 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6ef763f57c66..c198ba466853 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
return ret;
 }
 
-/**
- * ring_buffer_wait - wait for input to the ring buffer
- * @buffer: buffer to wait on
- * @cpu: the cpu buffer to wait on
- * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
- *
- * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
- * as data is added to any of the @buffer's cpu buffers. Otherwise
- * it will wait for data to be added to a specific cpu buffer.
- */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+static inline bool
+rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer,
+int cpu, int full, ring_buffer_cond_fn cond, void *data)
 {
-   struct ring_buffer_per_cpu *cpu_buffer;
-   DEFINE_WAIT(wait);
-   struct rb_irq_work *work;
-   int ret = 0;
-
-   /*
-* Depending on what the caller is waiting for, either any
-* data in any cpu buffer, or a specific buffer, put the
-* caller on the appropriate wait queue.
-*/
-   if (cpu == RING_BUFFER_ALL_CPUS) {
-   work = &buffer->irq_work;
-   /* Full only makes sense on per cpu reads */
-   full = 0;
-   } else {
-   if (!cpumask_test_cpu(cpu, buffer->cpumask))
-   return -ENODEV;
-   cpu_buffer = buffer->buffers[cpu];
-   work = &cpu_buffer->irq_work;
-   }
+   if (rb_watermark_hit(buffer, cpu, full))
+   return true;
 
-   if (full)
-   prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE);
-   else
-   prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE);
+   if (cond(data))
+   return true;
 
/*
 * The events can happen in critical sections where
@@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int 
cpu, int full)
 * a task has been queued. It's OK for spurious wake ups.
 */
if (full)
-   work->full_waiters_pending = true;
+   rbwork->full_waiters_pending = true;
else
-   work->waiters_pending = true;
+   rbwork->waiters_pending = true;
 
-   if (rb_watermark_hit(buffer, cpu, full))
-   goto out;
+   return false;
+}
 
-   if (signal_pending(current)) {
-   ret = -EINTR;
-   goto out;
+/*
+ * The default wait condition for ring_buffer_wait() is to just to exit the
+ * wait loop the first time it is woken up.
+ */
+static bool rb_wait_once(void *data)
+{
+   long *once = data;
+
+   /* wait_event() actually calls this twice before scheduling*/
+   if (*once > 1)
+   return true;
+
+   (*once)++;
+   return false;
+}
+
+/**
+ * ring_buffer_wait - wait for input to the ring buffer
+ * @buffer: buffer to wait on
+ * @cpu: the cpu buffer to

[PATCH v3 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-12 Thread Steven Rostedt




A patch was sent to "fix" the wait_index variable that is used to help with
waking of waiters on the ring buffer. The patch was rejected, but I started
looking at associated code. Discussing it on IRC with Mathieu Desnoyers
we discovered a design flaw.

The waiter reads "wait_index" then enters a "wait loop". After adding
itself to the wait queue, if the buffer is filled or a signal is pending
it will exit out. If wait_index is different, it will also exit out.
(Note, I noticed that the check for wait_index was after the schedule()
call and should have been before it, but that's besides the point).

The race is what happens if the waker updates the wait_index,
a new waiter comes in and reads the updated wait_index before it adds
itself to the queue, it will miss the update!

These patches fix the design by converting the ring_buffer_wait()
to use the wait_event_interruptible() interface.

Then the wait_to_pipe() caller will pass its own condition into
the ring_buffer_wait() to be checked by the wait_event_interruptible().

- The first patch restructures the ring_buffer_wait() to use the
  wait_event_interruptible() logic. It does not change the interface,
  but adds a local "cond" condition callback function that will be
  what it defaults to in the second patch if NULL is passed in.

  The default cond function is just a "wait once" function. That
  is, the first time it is called (before the wait_event_interruptible()
  calls schedule) will set the "once" variable to one and return
  false. The next time it is called (after wait_event_interruptible()
  wakes up) it will return true to break out of the loop.

- The second patch changes the ring_buffer_wait() interface to allow
  the caller to define its own "cond" callback. That will be checked
  in wait_event_interruptible() along with checking if the proper
  amount of data in the ring buffer has been hit.

Changes since v2: 
https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/

- Patches 1-3 of v2 have been accepted.

- Instead of breaking ring_buffer_wait() into three functions that do
  a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait()
  take a condition callback function and a data parameter.

  The ring_buffer_wait() now uses a wait_event_interruptible() code,
  and added a helper function to do check its own conditions, and also
  to call the passed in condition function and allow the caller to
  specify other conditions to break out of the event loop.

Changes since v1: 
https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/

- My tests triggered a warning about calling a mutex_lock() after a
  prepare_to_wait() that changed the task's state. Convert the affected
  mutex over to a spinlock.



Steven Rostedt (Google) (2):
  ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
  tracing/ring-buffer: Fix wait_on_pipe() race


 include/linux/ring_buffer.h  |   4 +-
 include/linux/trace_events.h |   5 +-
 kernel/trace/ring_buffer.c   | 117 +--
 kernel/trace/trace.c |  43 +++-
 4 files changed, 107 insertions(+), 62 deletions(-)

[PATCH v3 2/2] tracing/ring-buffer: Fix wait_on_pipe() race

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 11 -
 kernel/trace/trace.c | 43 ++--
 4 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577..dc5ae4e96aee 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..fc6d0af56bb1 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c198ba466853..81a5303bdc09 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -906,18 +906,19 @@ static bool rb_wait_once(void *data)
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data)
 {
struct ring_buffer_per_cpu *cpu_buffer;
struct wait_queue_head *waitq;
-   ring_buffer_cond_fn cond;
struct rb_irq_work *rbwork;
-   void *data;
long once = 0;
int ret = 0;
 
-   cond = rb_wait_once;
-   data = &once;
+   if (!cond) {
+   cond = rb_wait_once;
+   data = &once;
+   }
 
/*
 * Depending on what the calle

Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Krzysztof Kozlowski

On 11/03/2024 17:27, Tanmay Shah wrote:
>>> +then:
>>> +  patternProperties:
>>> +"^r5f@[0-9a-f]+$":
>>> +  type: object
>>> +
>>> +  properties:
>>> +reg:
>>> +  minItems: 1
>>> +  items:
>>> +- description: ATCM internal memory
>>> +- description: BTCM internal memory
>>> +- description: extra ATCM memory in lockstep mode
>>> +- description: extra BTCM memory in lockstep mode
>>> +
>>> +reg-names:
>>> +  minItems: 1
>>> +  items:
>>> +- const: atcm0
>>> +- const: btcm0
>>> +- const: atcm1
>>> +- const: btcm1
>>
>> Why power domains are flexible?
> 
> User may not want to use all the TCMs. For example, if users want to turn-on 
> only TCM-A and rest of them want to keep off, then
> 
> they can avoid having power-domains of other TCMs in the device-tree. This 
> helps with less power-consumption when needed.
> 
> Hence flexible list of power-domains list.
> 

Isn't turning on/off driver's job? Sorry, but what is "user" here? DTS
describes bindings, not OS policy.

Also, please wrap your replies to match email style.

> I can certainly mention "items:" under power-domains property.
> 
> 
>>


Best regards,
Krzysztof

Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Krzysztof Kozlowski

On 11/03/2024 19:39, Tanmay Shah wrote:
>>> +
>>> +else:
>>> +  patternProperties:
>>> +"^r5f@[0-9a-f]+$":
>>> +  type: object
>>> +
>>> +  properties:
>>> +reg:
>>> +  minItems: 1
>>> +  items:
>>> +- description: ATCM internal memory
>>> +- description: BTCM internal memory
>>> +
>>> +reg-names:
>>> +  minItems: 1
>>> +  items:
>>> +- const: atcm0
>>> +- const: btcm0
>>> +
>>> +power-domains:
>>> +  maxItems: 3
>>
>> Please list power domains.
> 
> Hello,
> 
> Sent v13 addressing both comments.
> 

And gave me exactly two hours to disagree?

Best regards,
Krzysztof

Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Krzysztof Kozlowski

On 11/03/2024 18:59, Tanmay Shah wrote:
> From: Radhey Shyam Pandey 
> 
> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
> UltraScale+ platform. It will help in defining TCM in device-tree
> and make it's access platform agnostic and data-driven.
> 
> Tightly-coupled memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
> 
> The TCM resources(reg, reg-names and power-domain) are documented for
> each TCM in the R5 node. The reg and reg-names are made as required
> properties as we don't want to hardcode TCM addresses for future
> platforms and for zu+ legacy implementation will ensure that the
> old dts w/o reg/reg-names works and stable ABI is maintained.
> 
> It also extends the examples for TCM split and lockstep modes.
> 
> Signed-off-by: Radhey Shyam Pandey 
> Signed-off-by: Tanmay Shah 
> ---
> 
> Changes in v13:
>   - Have power-domains property for lockstep case instead of
> keeping it flexible.
>   - Add "items:" list in power-domains property


Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

[PATCH v4 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Convert ring_buffer_wait() over to wait_event_interruptible(). The default
condition is to execute the wait loop inside __wait_event() just once.

This does not change the ring_buffer_wait() prototype yet, but
restructures the code so that it can take a "cond" and "data" parameter
and will call wait_event_interruptible() with a helper function as the
condition.

The helper function (rb_wait_cond) takes the cond function and data
parameters. It will first check if the buffer hit the watermark defined by
the "full" parameter and then call the passed in condition parameter. If
either are true, it returns true.

If rb_wait_cond() does not return true, it will set the appropriate
"waiters_pending" flag and returns false.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |   1 +
 kernel/trace/ring_buffer.c  | 116 +---
 2 files changed, 69 insertions(+), 48 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..338a33db1577 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6ef763f57c66..c198ba466853 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
return ret;
 }
 
-/**
- * ring_buffer_wait - wait for input to the ring buffer
- * @buffer: buffer to wait on
- * @cpu: the cpu buffer to wait on
- * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
- *
- * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
- * as data is added to any of the @buffer's cpu buffers. Otherwise
- * it will wait for data to be added to a specific cpu buffer.
- */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+static inline bool
+rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer,
+int cpu, int full, ring_buffer_cond_fn cond, void *data)
 {
-   struct ring_buffer_per_cpu *cpu_buffer;
-   DEFINE_WAIT(wait);
-   struct rb_irq_work *work;
-   int ret = 0;
-
-   /*
-* Depending on what the caller is waiting for, either any
-* data in any cpu buffer, or a specific buffer, put the
-* caller on the appropriate wait queue.
-*/
-   if (cpu == RING_BUFFER_ALL_CPUS) {
-   work = &buffer->irq_work;
-   /* Full only makes sense on per cpu reads */
-   full = 0;
-   } else {
-   if (!cpumask_test_cpu(cpu, buffer->cpumask))
-   return -ENODEV;
-   cpu_buffer = buffer->buffers[cpu];
-   work = &cpu_buffer->irq_work;
-   }
+   if (rb_watermark_hit(buffer, cpu, full))
+   return true;
 
-   if (full)
-   prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE);
-   else
-   prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE);
+   if (cond(data))
+   return true;
 
/*
 * The events can happen in critical sections where
@@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int 
cpu, int full)
 * a task has been queued. It's OK for spurious wake ups.
 */
if (full)
-   work->full_waiters_pending = true;
+   rbwork->full_waiters_pending = true;
else
-   work->waiters_pending = true;
+   rbwork->waiters_pending = true;
 
-   if (rb_watermark_hit(buffer, cpu, full))
-   goto out;
+   return false;
+}
 
-   if (signal_pending(current)) {
-   ret = -EINTR;
-   goto out;
+/*
+ * The default wait condition for ring_buffer_wait() is to just to exit the
+ * wait loop the first time it is woken up.
+ */
+static bool rb_wait_once(void *data)
+{
+   long *once = data;
+
+   /* wait_event() actually calls this twice before scheduling*/
+   if (*once > 1)
+   return true;
+
+   (*once)++;
+   return false;
+}
+
+/**
+ * ring_buffer_wait - wait for input to the ring buffer
+ * @buffer: buffer to wait on
+ * @cpu: the cpu buffer to

[PATCH v4 2/2] tracing/ring-buffer: Fix wait_on_pipe() race

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the trace_pipe_raw file is closed, there should be no new readers on
the file descriptor. This is mostly handled with the waking and wait_index
fields of the iterator. But there's still a slight race.

 CPU 0  CPU 1
 -  -
   wait_index++;
   index = wait_index;
   ring_buffer_wake_waiters();
   wait_on_pipe()
 ring_buffer_wait();

The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is
that the ring_buffer_wait() needs the logic of:

prepare_to_wait();
if (!condition)
schedule();

Where the missing condition check is the iter->wait_index update.

Have the ring_buffer_wait() take a conditional callback function and a
data parameter that can be used within the wait_event_interruptible() of
the ring_buffer_wait() function.

In wait_on_pipe(), pass a condition function that will check if the
wait_index has been updated, if it has, it will return true to break out
of the wait_event_interruptible() loop.

Create a new field "closed" in the trace_iterator and set it in the
.flush() callback before calling ring_buffer_wake_waiters().
This will keep any new readers from waiting on a closed file descriptor.

Have the wait_on_pipe() condition callback also check the closed field.

Change the wait_index field of the trace_iterator to atomic_t. There's no
reason it needs to be 'long' and making it atomic and using
atomic_read_acquire() and atomic_fetch_inc_release() will provide the
necessary memory barriers.

Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after
one more try to fetch data. That is, if it waited for data and something
woke it up, it should try to collect any new data and then exit back to
user space.

Link: 
https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the 
file")
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  |  3 ++-
 include/linux/trace_events.h |  5 -
 kernel/trace/ring_buffer.c   | 13 ++-
 kernel/trace/trace.c | 43 ++--
 4 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 338a33db1577..dc5ae4e96aee 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
 })
 
 typedef bool (*ring_buffer_cond_fn)(void *data);
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
  struct file *filp, poll_table *poll_table, int full);
 void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..fc6d0af56bb1 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -103,13 +103,16 @@ struct trace_iterator {
unsigned inttemp_size;
char*fmt;   /* modified format holder */
unsigned intfmt_size;
-   longwait_index;
+   atomic_twait_index;
 
/* trace_seq for __print_flags() and __print_symbolic() etc. */
struct trace_seqtmp_seq;
 
cpumask_var_t   started;
 
+   /* Set when the file is closed to prevent new waiters */
+   boolclosed;
+
/* it's true when current open file is snapshot */
boolsnapshot;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c198ba466853..67d8405f4451 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -901,23 +901,26 @@ static bool rb_wait_once(void *data)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
  * @full: wait until the percentage of pages are available, if @cpu != 
RING_BUFFER_ALL_CPUS
+ * @cond: condition function to break out of wait (NULL to run once)
+ * @data: the data to pass to @cond.
  *
  * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
+ring_buffer_cond_fn cond, void *data)
 {
struct ring_buffer_per_cpu *cpu_buffer;
struct wait_queue_hea

[PATCH v4 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-12 Thread Steven Rostedt

[ Sorry for the spam, but I just noticed kernel-test-robot complained
  about KernelDoc format, which this fixes ]


A patch was sent to "fix" the wait_index variable that is used to help with
waking of waiters on the ring buffer. The patch was rejected, but I started
looking at associated code. Discussing it on IRC with Mathieu Desnoyers
we discovered a design flaw.

The waiter reads "wait_index" then enters a "wait loop". After adding
itself to the wait queue, if the buffer is filled or a signal is pending
it will exit out. If wait_index is different, it will also exit out.
(Note, I noticed that the check for wait_index was after the schedule()
call and should have been before it, but that's besides the point).

The race is what happens if the waker updates the wait_index,
a new waiter comes in and reads the updated wait_index before it adds
itself to the queue, it will miss the update!

These patches fix the design by converting the ring_buffer_wait()
to use the wait_event_interruptible() interface.

Then the wait_to_pipe() caller will pass its own condition into
the ring_buffer_wait() to be checked by the wait_event_interruptible().

- The first patch restructures the ring_buffer_wait() to use the
  wait_event_interruptible() logic. It does not change the interface,
  but adds a local "cond" condition callback function that will be
  what it defaults to in the second patch if NULL is passed in.

  The default cond function is just a "wait once" function. That
  is, the first time it is called (before the wait_event_interruptible()
  calls schedule) will set the "once" variable to one and return
  false. The next time it is called (after wait_event_interruptible()
  wakes up) it will return true to break out of the loop.

- The second patch changes the ring_buffer_wait() interface to allow
  the caller to define its own "cond" callback. That will be checked
  in wait_event_interruptible() along with checking if the proper
  amount of data in the ring buffer has been hit.

Changes since v3: 
https://lore.kernel.org/all/20240312120252.998983...@goodmis.org/

- I should have checked before sending v2, but after I did, I noticed
  that kernel-test-robot reported that the cond and data parameters
  added to ring_buffer_wait() were not added to the kerneldoc above it.

Changes since v2: 
https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/

- Patches 1-3 of v2 have been accepted.

- Instead of breaking ring_buffer_wait() into three functions that do
  a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait()
  take a condition callback function and a data parameter.

  The ring_buffer_wait() now uses a wait_event_interruptible() code,
  and added a helper function to do check its own conditions, and also
  to call the passed in condition function and allow the caller to
  specify other conditions to break out of the event loop.

Changes since v1: 
https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/

- My tests triggered a warning about calling a mutex_lock() after a
  prepare_to_wait() that changed the task's state. Convert the affected
  mutex over to a spinlock.


Steven Rostedt (Google) (2):
  ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
  tracing/ring-buffer: Fix wait_on_pipe() race


 include/linux/ring_buffer.h  |   4 +-
 include/linux/trace_events.h |   5 +-
 kernel/trace/ring_buffer.c   | 119 ++-
 kernel/trace/trace.c |  43 +++-
 4 files changed, 109 insertions(+), 62 deletions(-)

[PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The check for knowing if the poll should wait or not is basically the
exact same logic as rb_watermark_hit(). The only difference is that
rb_watermark_hit() also handles the !full case. But for the full case, the
logic is the same. Just call that instead of duplicating the code in
ring_buffer_poll_wait().

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index adfe603a769b..857803e8cf07 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -959,25 +959,18 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
}
 
if (full) {
-   unsigned long flags;
-
poll_wait(filp, &rbwork->full_waiters, poll_table);
 
-   raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
-   raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-   if (full_hit(buffer, cpu, full))
+   if (rb_watermark_hit(buffer, cpu, full))
return EPOLLIN | EPOLLRDNORM;
/*
 * Only allow full_waiters_pending update to be seen after
-* the shortest_full is set. If the writer sees the
-* full_waiters_pending flag set, it will compare the
-* amount in the ring buffer to shortest_full. If the amount
-* in the ring buffer is greater than the shortest_full
-* percent, it will call the irq_work handler to wake up
-* this list. The irq_handler will reset shortest_full
+* the shortest_full is set (in rb_watermark_hit). If the
+* writer sees the full_waiters_pending flag set, it will
+* compare the amount in the ring buffer to shortest_full.
+* If the amount in the ring buffer is greater than the
+* shortest_full percent, it will call the irq_work handler
+* to wake up this list. The irq_handler will reset 
shortest_full
 * back to zero. That's done under the reader_lock, but
 * the below smp_mb() makes sure that the update to
 * full_waiters_pending doesn't leak up into the above.
-- 
2.43.0

[PATCH v2 0/2] ring-buffer: Fix poll wakeup logic

2024-03-12 Thread Steven Rostedt



After making a slight change to wakeups in ring_buffer_wait()
the system would hang. Spending several hours going on a wild goose
chase I found that the change only triggered the bug because it
changed the timings. The bug was there before the update but never
was triggered.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

The race was triggered when running:

  trace-cmd record -p function -m 5000

Which enables function tracing and then creates two files it is writing
into where each is 2500K in size. The -m is a "max file size". When
trace-cmd writes 2500K to one file it then switches to the other, erasing
the old data. To do this, trace-cmd switches between both poll and
the reader using both methods of wake up. The change to the reader wakeup
was able to change the way the poll was woken to trigger this bug.

The second patch is a clean up and also a way to consolidate the logic
of the shortest_full. The read wakeup uses rb_watermark_hit for both
full wakeups and !full wakeups. But since poll uses the same logic for
full wakeups it can just call that function with full set.

Changes since v1: 
https://lore.kernel.org/all/20240312115455.666920...@goodmis.org/

- Removed unused 'flags' in ring_buffer_poll_wait() as the spin_lock
  is now in rb_watermark_hit().


Steven Rostedt (Google) (2):
  ring-buffer: Fix full_waiters_pending in poll
  ring-buffer: Reuse rb_watermark_hit() for the poll logic


 kernel/trace/ring_buffer.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

[PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If a reader of the ring buffer is doing a poll, and waiting for the ring
buffer to hit a specific watermark, there could be a case where it gets
into an infinite ping-pong loop.

The poll code has:

  rbwork->full_waiters_pending = true;
  if (!cpu_buffer->shortest_full ||
  cpu_buffer->shortest_full > full)
 cpu_buffer->shortest_full = full;

The writer will see full_waiters_pending and check if the ring buffer is
filled over the percentage of the shortest_full value. If it is, it calls
an irq_work to wake up all the waiters.

But the code could get into a circular loop:

CPU 0   CPU 1
-   -
 [ Poll ]
   [ shortest_full = 0 ]
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]
   if ([ buffer percent ] > full)
  break;
   rbwork->full_waiters_pending = true;
  if (rbwork->full_waiters_pending &&
  [ buffer percent ] > 
shortest_full) {
 rbwork->wakeup_full = true;
 [ queue_irqwork ]

   cpu_buffer->shortest_full = full;

  [ IRQ work ]
  if (rbwork->wakeup_full) {
cpu_buffer->shortest_full = 0;
wakeup poll waiters;
  [woken]

 [ Wash, rinse, repeat! ]

In the poll, the shortest_full needs to be set before the
full_pending_waiters, as once that is set, the writer will compare the
current shortest_full (which is incorrect) to decide to call the irq_work,
which will reset the shortest_full (expecting the readers to update it).

Also move the setting of full_waiters_pending after the check if the ring
buffer has the required percentage filled. There's no reason to tell the
writer to wake up waiters if there are no waiters.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 27 ---
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa332ace108b..adfe603a769b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
*buffer, int cpu,
poll_wait(filp, &rbwork->full_waiters, poll_table);
 
raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
-   rbwork->full_waiters_pending = true;
if (!cpu_buffer->shortest_full ||
cpu_buffer->shortest_full > full)
cpu_buffer->shortest_full = full;
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
-   } else {
-   poll_wait(filp, &rbwork->waiters, poll_table);
-   rbwork->waiters_pending = true;
+   if (full_hit(buffer, cpu, full))
+   return EPOLLIN | EPOLLRDNORM;
+   /*
+* Only allow full_waiters_pending update to be seen after
+* the shortest_full is set. If the writer sees the
+* full_waiters_pending flag set, it will compare the
+* amount in the ring buffer to shortest_full. If the amount
+* in the ring buffer is greater than the shortest_full
+* percent, it will call the irq_work handler to wake up
+* this list. The irq_handler will reset shortest_full
+* back to zero. That's done under the reader_lock, but
+* the below smp_mb() makes sure that the update to
+* full_waiters_pending doesn't leak up into the above.
+*/
+   smp_mb();
+   rbwork->full_waiters_pending = true;
+   return 0;
}
 
+   poll_wait(filp, &rbwork->waiters, poll_table);
+   rbwork->waiters_pending = true;
+
/*
 * There's a tight race between setting the waiters_pending and
 * checking if the ring buffer is empty.  Once the waiters_pending bit
@@ -989,9 +1005,

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-12 Thread Mark Rutland

Hi Bjorn

(apologies, my corporate mail server has butchered your name here).

There's a big info dump below; I realise this sounds like a sales pitch for
CALL_OPS, but my intent is more to say "here are some dragons you may not have
spotted".

On Thu, Mar 07, 2024 at 08:27:40PM +0100, Bj"orn T"opel wrote:
> Puranjay!
> 
> Puranjay Mohan  writes:
> 
> > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
> > This allows each ftrace callsite to provide an ftrace_ops to the common
> > ftrace trampoline, allowing each callsite to invoke distinct tracer
> > functions without the need to fall back to list processing or to
> > allocate custom trampolines for each callsite. This significantly speeds
> > up cases where multiple distinct trace functions are used and callsites
> > are mostly traced by a single tracer.
> >
> > The idea and most of the implementation is taken from the ARM64's
> > implementation of the same feature. The idea is to place a pointer to
> > the ftrace_ops as a literal at a fixed offset from the function entry
> > point, which can be recovered by the common ftrace trampoline.
> 
> Not really a review, but some more background; Another rationale (on-top
> of the improved per-call performance!) for CALL_OPS was to use it to
> build ftrace direct call support (which BPF uses a lot!). Mark, please
> correct me if I'm lying here!

Yep; it gives us the ability to do a number of per-callsite things, including
direct calls.

> On Arm64, CALL_OPS makes it possible to implement direct calls, while
> only patching one BL instruction -- nice!

The key thing here isn't that we patch a single instruction (since we have ot
patch the ops pointer too!); it's that we can safely patch either of the ops
pointer or BL/NOP at any time while threads are concurrently executing.

If you have a multi-instruction sequence, then threads can be preempted
mid-sequence, and it's very painful/complex to handle all of the races that
entails.

For example, if your callsites use a sequence:

AUIPC , 
JALR , ()

Using stop_machine() won't allow you to patch that safely as some threads
could be stuck mid-sequence, e.g.

AUIPC , 
[ preempted here ]
JALR , ()

... and you can't update the JALR to use a new funcptr immediate until those
have completed the sequence.

There are ways around that, but they're complicated and/or expensive, e.g.

* Use a sequence of multiple patches, starting with replacing the JALR with an
  exception-generating instruction with a fixup handler, which is sort-of what
  x86 does with UD2. This may require multiple passes with
  synchronize_rcu_tasks() to make sure all threads have seen the latest
  instructions, and that cannot be done under stop_machine(), so if you need
  stop_machine() for CMODx reasons, you may need to use that several times with
  intervening calls to synchronize_rcu_tasks().

* Have the patching logic manually go over each thread and fix up the pt_regs
  for the interrupted thread. This is pretty horrid since you could have nested
  exceptions and a task could have several pt_regs which might require
  updating.

The CALL_OPS approach is a bit easier to deal with as we can patch the
per-callsite pointer atomically, then we can (possibly) enable/disable the
callsite's branch, then wait for threads to drain once. 

As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE
generally in this area (CALL_OPs happens to side-step those, but trampoline
usage is currently affected):

  https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/

... I'm looking into fixing that at the moment, and it looks like that's likely
to require some per-architecture changes.

> On RISC-V we cannot use use the same ideas as Arm64 straight off,
> because the range of jal (compare to BL) is simply too short (+/-1M).
> So, on RISC-V we need to use a full auipc/jal pair (the text patching
> story is another chapter, but let's leave that aside for now). Since we
> have to patch multiple instructions, the cmodx situation doesn't really
> improve with CALL_OPS.

The branch range thing is annoying, but I think this boils down to the same
problem as arm64 has with needing a "MOV , LR" instruction that we have to
patch in once at boot time. You could do the same and patch in the AUIPC once,
e.g. have

|   NOP
|   NOP 
| func:
|   AUIPC , 
|   JALR , () // patched with NOP

... which'd look very similar to arm64's sequence:

|   NOP
|   NOP
| func:
|   MOV X9, LR
|   BL ftrace_caller // patched with NOP

... which I think means it *might* be better from a cmodx perspective?

> Let's say that we continue building on your patch and implement direct
> calls on CALL_OPS for RISC-V as well.
> 
> From Florent's commit message for direct calls:
> 
>   |There are a few cases to distinguish:
>   |- If a direct call ops is the only one tracing a function:
>   |  - If the direct called trampoline is within

Re: [PATCHv2 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-03-12 Thread Heikki Krogerus

Hi Pavel,

I'm sorry to keep you waiting.

On Fri, Feb 23, 2024 at 10:28:49PM +0100, Pavel Machek wrote:
> From: Ondrej Jirman 
> 
> This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> features removed. ANX7688 is rather criticial piece on PinePhone,
> there's no display and no battery charging without it.
> 
> There's likely more work to be done here, but having basic support
> in mainline is needed to be able to work on the other stuff
> (networking, cameras, power management).
> 
> Signed-off-by: Ondrej Jirman 
> Co-developed-by: Martijn Braam 
> Co-developed-by: Samuel Holland 
> Signed-off-by: Pavel Machek 
> 
> ---
> 
> v2: Fix checkpatch stuff. Some cleanups, adapt to dts format in 1/2.
> 
> diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig
> index 2f80c2792dbd..c9043ae61546 100644
> --- a/drivers/usb/typec/Kconfig
> +++ b/drivers/usb/typec/Kconfig
> @@ -64,6 +64,17 @@ config TYPEC_ANX7411
> If you choose to build this driver as a dynamically linked module, the
> module will be called anx7411.ko.
>  
> +config TYPEC_ANX7688
> + tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver"
> + depends on I2C
> + depends on USB_ROLE_SWITCH
> + help
> +   Say Y or M here if your system has Analogix ANX7688 Type-C Bridge
> +   controller driver.
> +
> +   If you choose to build this driver as a dynamically linked module, the
> +   module will be called anx7688.ko.
> +
>  config TYPEC_RT1719
>   tristate "Richtek RT1719 Sink Only Type-C controller driver"
>   depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH
> diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile
> index 7a368fea61bc..3f8ff94ad294 100644
> --- a/drivers/usb/typec/Makefile
> +++ b/drivers/usb/typec/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM)  += tcpm/
>  obj-$(CONFIG_TYPEC_UCSI) += ucsi/
>  obj-$(CONFIG_TYPEC_TPS6598X) += tipd/
>  obj-$(CONFIG_TYPEC_ANX7411)  += anx7411.o
> +obj-$(CONFIG_TYPEC_ANX7688)  += anx7688.o
>  obj-$(CONFIG_TYPEC_HD3SS3220)+= hd3ss3220.o
>  obj-$(CONFIG_TYPEC_STUSB160X)+= stusb160x.o
>  obj-$(CONFIG_TYPEC_RT1719)   += rt1719.o
> diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c
> new file mode 100644
> index ..14d033bbc0c7
> --- /dev/null
> +++ b/drivers/usb/typec/anx7688.c
> @@ -0,0 +1,1927 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ANX7688 USB-C HDMI bridge/PD driver
> + *
> + * Warning, this driver is somewhat PinePhone specific.
> + *
> + * How this works:
> + * - this driver allows to program firmware into ANX7688 EEPROM, and
> + *   initialize it
> + * - it then communicates with the firmware running on the OCM (on-chip
> + *   microcontroller)
> + * - it detects whether there is cable plugged in or not and powers
> + *   up or down the ANX7688 based on that
> + * - when the cable is connected the firmware on the OCM will handle
> + *   the detection of the nature of the device on the other end
> + *   of the USB-C cable
> + * - this driver then communicates with the USB phy to let it swap
> + *   data roles accordingly
> + * - it also enables VBUS and VCONN regulators as appropriate
> + * - USB phy driver (Allwinner) needs to know whether to switch to
> + *   device or host mode, or whether to turn off
> + * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins
> + *   or something else via PD, it notifies this driver via software
> + *   interrupt and this driver will determine how to update the TypeC
> + *   port status and what input current limit is appropriate
> + * - input current limit determination happens 500ms after cable
> + *   insertion or hard reset (delay is necessary to determine whether
> + *   the remote end is PD capable or not)
> + * - this driver tells to the PMIC driver that the input current limit
> + *   needs to be changed
> + * - this driver also monitors PMIC status and re-sets the input current
> + *   limit if it changes for some reason (due to PMIC internal decision
> + *   making) (this is disabled for now)
> + *
> + * ANX7688 FW behavior as observed:
> + *
> + * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what
> + *   you set and send hardcoded PDO_BATT 5-21V 30W message!
> + *
> + * Product brief:
> + * 
> https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf
> + * Development notes:
> + * https://xnux.eu/devices/feature/anx7688.html
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* firmware regs */
> +
> +#define ANX7688_REG_VBUS_OFF_DELAY_TIME 0x22
> +#define ANX7688_REG_FEATURE_CTRL0x27
> +#define ANX7688_REG_EEPROM_LOAD_STATUS1 0x11
> +#define ANX7688_REG_EEPROM_LOAD_STATUS0 0x12
> +#define ANX7688_REG_FW_VERSION1 0x15
> +#define ANX7688_REG_

Re: [PATCH 3/3] ARM: dts: qcom: Add Sony Xperia Z3 smartphone

2024-03-12 Thread Konrad Dybcio





On 3/10/24 12:41, Luca Weiss wrote:

Add the dts for the Xperia Z3 smartphone which is based on Sony's
shinano platform, so at the moment there's little device-specific dts to
add on top of the common parts.

Signed-off-by: Luca Weiss 
---


modulo missing makfile, this looks good

Re: [PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Google

On Tue, 12 Mar 2024 09:19:20 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> If a reader of the ring buffer is doing a poll, and waiting for the ring
> buffer to hit a specific watermark, there could be a case where it gets
> into an infinite ping-pong loop.
> 
> The poll code has:
> 
>   rbwork->full_waiters_pending = true;
>   if (!cpu_buffer->shortest_full ||
>   cpu_buffer->shortest_full > full)
>  cpu_buffer->shortest_full = full;
> 
> The writer will see full_waiters_pending and check if the ring buffer is
> filled over the percentage of the shortest_full value. If it is, it calls
> an irq_work to wake up all the waiters.
> 
> But the code could get into a circular loop:
> 
>   CPU 0   CPU 1
>   -   -
>  [ Poll ]
>[ shortest_full = 0 ]
>rbwork->full_waiters_pending = true;
> if (rbwork->full_waiters_pending &&
> [ buffer percent ] > 
> shortest_full) {
>rbwork->wakeup_full = true;
>[ queue_irqwork ]

Oh, so `[ buffer percent ] > shortest_full` does not work because
if this happens in this order, shortest_full may be 0.

> 
>cpu_buffer->shortest_full = full;
> 
> [ IRQ work ]
> if (rbwork->wakeup_full) {
>   cpu_buffer->shortest_full = 0;
>   wakeup poll waiters;
>   [woken]
>if ([ buffer percent ] > full)
>   break;
>rbwork->full_waiters_pending = true;
> if (rbwork->full_waiters_pending &&
> [ buffer percent ] > 
> shortest_full) {
>rbwork->wakeup_full = true;
>[ queue_irqwork ]
> 
>cpu_buffer->shortest_full = full;
> 
> [ IRQ work ]
> if (rbwork->wakeup_full) {
>   cpu_buffer->shortest_full = 0;
>   wakeup poll waiters;
>   [woken]
> 
>  [ Wash, rinse, repeat! ]
> 
> In the poll, the shortest_full needs to be set before the
> full_pending_waiters, as once that is set, the writer will compare the
> current shortest_full (which is incorrect) to decide to call the irq_work,
> which will reset the shortest_full (expecting the readers to update it).
> 
> Also move the setting of full_waiters_pending after the check if the ring
> buffer has the required percentage filled. There's no reason to tell the
> writer to wake up waiters if there are no waiters.
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,


> Cc: sta...@vger.kernel.org
> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 27 ---
>  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index aa332ace108b..adfe603a769b 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
> *buffer, int cpu,
>   poll_wait(filp, &rbwork->full_waiters, poll_table);
>  
>   raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> - rbwork->full_waiters_pending = true;
>   if (!cpu_buffer->shortest_full ||
>   cpu_buffer->shortest_full > full)
>   cpu_buffer->shortest_full = full;
>   raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> - } else {
> - poll_wait(filp, &rbwork->waiters, poll_table);
> - rbwork->waiters_pending = true;
> + if (full_hit(buffer, cpu, full))
> + return EPOLLIN | EPOLLRDNORM;
> + /*
> +  * Only allow full_waiters_pending update to be seen after
> +  * the shortest_full is set. If the writer sees the
> +  * full_waiters_pending flag set, it will compare the
> +  * amount in the ring buffer to shortest_full. If the amount
> +  * in the ring buffer is greater than the shortest_full
> +  * percent, it will call the irq_work handler to wake up
> +  * this list. The irq_handler will reset shortest_full
> +  * back to zero. That's done under the reader_lock, but
> +  * the below smp_mb() makes sure that the update to
> +  * full_waiters_pending doesn't leak up into the above.
> +  */
> +

[PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The WARN_ON() check in __assign_str() to catch where the source variable
to the macro doesn't match the source variable to __string() gives an
error in clang:

>> include/trace/events/sunrpc.h:703:4: warning: result of comparison against a 
>> string literal is unspecified (use an explicit string comparison function 
>> instead) [-Wstring-compare]
 670 | __assign_str(progname, "unknown");

That's because the __assign_str() macro has:

   WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);

Where "src" is a string literal. Clang warns when comparing a string
literal directly as it is undefined to what the value of the literal is.

Since this is still to make sure the same string that goes to __string()
is the same as __assign_str(), for string literals do a test for that and
then use strcmp() in those cases

Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix
tracepoints that save qdisc_dev() as a string") being applied, as this was
what found that bug.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/
Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does not 
match __string()")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index a0c15f67eabe..83da83a0c14f 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,7 +35,9 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
-   WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);   \
+   WARN_ON_ONCE(__builtin_constant_p(src) ?\
+strcmp((src), __data_offsets.dst##_ptr_) : \
+(src) != __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
-- 
2.43.0

Re: [PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll

2024-03-12 Thread Steven Rostedt

On Wed, 13 Mar 2024 00:22:10 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Mar 2024 09:19:20 -0400
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > If a reader of the ring buffer is doing a poll, and waiting for the ring
> > buffer to hit a specific watermark, there could be a case where it gets
> > into an infinite ping-pong loop.
> > 
> > The poll code has:
> > 
> >   rbwork->full_waiters_pending = true;
> >   if (!cpu_buffer->shortest_full ||
> >   cpu_buffer->shortest_full > full)
> >  cpu_buffer->shortest_full = full;
> > 
> > The writer will see full_waiters_pending and check if the ring buffer is
> > filled over the percentage of the shortest_full value. If it is, it calls
> > an irq_work to wake up all the waiters.
> > 
> > But the code could get into a circular loop:
> > 
> > CPU 0   CPU 1
> > -   -
> >  [ Poll ]
> >[ shortest_full = 0 ]
> >rbwork->full_waiters_pending = true;
> >   if (rbwork->full_waiters_pending &&
> >   [ buffer percent ] > 
> > shortest_full) {
> >  rbwork->wakeup_full = true;
> >  [ queue_irqwork ]  
> 
> Oh, so `[ buffer percent ] > shortest_full` does not work because
> if this happens in this order, shortest_full may be 0.

Exactly!

> 
> > 
> >cpu_buffer->shortest_full = full;
> > 
> >   [ IRQ work ]
> >   if (rbwork->wakeup_full) {
> > cpu_buffer->shortest_full = 0;

And here shortest_full gets set back to zero! (But that's not the bug).

> > wakeup poll waiters;
> >   [woken]
> >if ([ buffer percent ] > full)
> >   break;
> >rbwork->full_waiters_pending = true;

The bug is setting full_waiters_pending before updating the shortest_full.

> >   if (rbwork->full_waiters_pending &&
> >   [ buffer percent ] > 
> > shortest_full) {
> >  rbwork->wakeup_full = true;
> >  [ queue_irqwork ]
> > 
> >cpu_buffer->shortest_full = full;
> > 
> >   [ IRQ work ]
> >   if (rbwork->wakeup_full) {
> > cpu_buffer->shortest_full = 0;
> > wakeup poll waiters;
> >   [woken]
> > 
> >  [ Wash, rinse, repeat! ]
> > 
> > In the poll, the shortest_full needs to be set before the
> > full_pending_waiters, as once that is set, the writer will compare the
> > current shortest_full (which is incorrect) to decide to call the irq_work,
> > which will reset the shortest_full (expecting the readers to update it).
> > 
> > Also move the setting of full_waiters_pending after the check if the ring
> > buffer has the required percentage filled. There's no reason to tell the
> > writer to wake up waiters if there are no waiters.
> >   
> 
> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks!

I'm running it through my tests and when they finish, I'll be posting the
for-linus patches.

-- Steve

Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Tanmay Shah



On 3/12/24 7:10 AM, Krzysztof Kozlowski wrote:
> On 11/03/2024 19:39, Tanmay Shah wrote:
> >>> +
> >>> +else:
> >>> +  patternProperties:
> >>> +"^r5f@[0-9a-f]+$":
> >>> +  type: object
> >>> +
> >>> +  properties:
> >>> +reg:
> >>> +  minItems: 1
> >>> +  items:
> >>> +- description: ATCM internal memory
> >>> +- description: BTCM internal memory
> >>> +
> >>> +reg-names:
> >>> +  minItems: 1
> >>> +  items:
> >>> +- const: atcm0
> >>> +- const: btcm0
> >>> +
> >>> +power-domains:
> >>> +  maxItems: 3
> >>
> >> Please list power domains.
> > 
> > Hello,
> > 
> > Sent v13 addressing both comments.
> > 
>
> And gave me exactly two hours to disagree?

I am sorry, I thought I was addressing the right comments.

It was minor change tempted me to push new revision, will wait longer next time.

>
> Best regards,
> Krzysztof
>

Re: [PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Google

On Tue, 12 Mar 2024 09:19:21 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The check for knowing if the poll should wait or not is basically the
> exact same logic as rb_watermark_hit(). The only difference is that
> rb_watermark_hit() also handles the !full case. But for the full case, the
> logic is the same. Just call that instead of duplicating the code in
> ring_buffer_poll_wait().
> 

This changes a bit (e.g. adding pagebusy check) but basically that should
be there. And new version appears to be consistent between ring_buffer_wait()
and ring_buffer_poll_wait(). So looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 21 +++--
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index adfe603a769b..857803e8cf07 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -959,25 +959,18 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer 
> *buffer, int cpu,
>   }
>  
>   if (full) {
> - unsigned long flags;
> -
>   poll_wait(filp, &rbwork->full_waiters, poll_table);
>  
> - raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
> - if (!cpu_buffer->shortest_full ||
> - cpu_buffer->shortest_full > full)
> - cpu_buffer->shortest_full = full;
> - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
> - if (full_hit(buffer, cpu, full))
> + if (rb_watermark_hit(buffer, cpu, full))
>   return EPOLLIN | EPOLLRDNORM;
>   /*
>* Only allow full_waiters_pending update to be seen after
> -  * the shortest_full is set. If the writer sees the
> -  * full_waiters_pending flag set, it will compare the
> -  * amount in the ring buffer to shortest_full. If the amount
> -  * in the ring buffer is greater than the shortest_full
> -  * percent, it will call the irq_work handler to wake up
> -  * this list. The irq_handler will reset shortest_full
> +  * the shortest_full is set (in rb_watermark_hit). If the
> +  * writer sees the full_waiters_pending flag set, it will
> +  * compare the amount in the ring buffer to shortest_full.
> +  * If the amount in the ring buffer is greater than the
> +  * shortest_full percent, it will call the irq_work handler
> +  * to wake up this list. The irq_handler will reset 
> shortest_full
>* back to zero. That's done under the reader_lock, but
>* the below smp_mb() makes sure that the update to
>* full_waiters_pending doesn't leak up into the above.
> -- 
> 2.43.0
> 
> 
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic

2024-03-12 Thread Steven Rostedt

On Wed, 13 Mar 2024 00:38:42 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 12 Mar 2024 09:19:21 -0400
> Steven Rostedt  wrote:
> 
> > From: "Steven Rostedt (Google)" 
> > 
> > The check for knowing if the poll should wait or not is basically the
> > exact same logic as rb_watermark_hit(). The only difference is that
> > rb_watermark_hit() also handles the !full case. But for the full case, the
> > logic is the same. Just call that instead of duplicating the code in
> > ring_buffer_poll_wait().
> >   
> 
> This changes a bit (e.g. adding pagebusy check) but basically that should
> be there. And new version appears to be consistent between ring_buffer_wait()
> and ring_buffer_poll_wait(). So looks good to me.

The pagebusy check is an optimization. As if it is true, it means the
writer is still on the reader_page and there's no sub-buffers available. It
just prevents having to do the calculation of the buffer-percentage filled
(what's done by the full_hit() logic).

> 
> Reviewed-by: Masami Hiramatsu (Google) 
>

Thanks!

-- Steve

Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Tanmay Shah



On 3/12/24 7:10 AM, Krzysztof Kozlowski wrote:
> On 11/03/2024 17:27, Tanmay Shah wrote:
> >>> +then:
> >>> +  patternProperties:
> >>> +"^r5f@[0-9a-f]+$":
> >>> +  type: object
> >>> +
> >>> +  properties:
> >>> +reg:
> >>> +  minItems: 1
> >>> +  items:
> >>> +- description: ATCM internal memory
> >>> +- description: BTCM internal memory
> >>> +- description: extra ATCM memory in lockstep mode
> >>> +- description: extra BTCM memory in lockstep mode
> >>> +
> >>> +reg-names:
> >>> +  minItems: 1
> >>> +  items:
> >>> +- const: atcm0
> >>> +- const: btcm0
> >>> +- const: atcm1
> >>> +- const: btcm1
> >>
> >> Why power domains are flexible?
> > 
> > User may not want to use all the TCMs. For example, if users want to 
> > turn-on only TCM-A and rest of them want to keep off, then
> > 
> > they can avoid having power-domains of other TCMs in the device-tree. This 
> > helps with less power-consumption when needed.
> > 
> > Hence flexible list of power-domains list.
> > 
>
> Isn't turning on/off driver's job? Sorry, but what is "user" here? DTS
> describes bindings, not OS policy.

Thanks for reviews.

Correct driver turns on off TCM. However, system designers (users) have option

to not include TCM that is not needed in device-tree. So 

power-domains are flexible, same as reg, and reg-names. ATCM is always

needed as vector table is in ATCM. R5 core power domain and ATCM

power-domain for each core is always required so minItems 2.



> Also, please wrap your replies to match email style.
>
> > I can certainly mention "items:" under power-domains property.
> > 
> > 
> >>
>
>
> Best regards,
> Krzysztof
>

[PATCH] ring-buffer: Do not set shortest_full when full target is hit

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The rb_watermark_hit() checks if the amount of data in the ring buffer is
above the percentage level passed in by the "full" variable. If it is, it
returns true.

But it also sets the "shortest_full" field of the cpu_buffer that informs
writers that it needs to call the irq_work if the amount of data on the
ring buffer is above the requested amount.

The rb_watermark_hit() always sets the shortest_full even if the amount in
the ring buffer is what it wants. As it is not going to wait, because it
has what it wants, there's no reason to set shortest_full.

Cc: sta...@vger.kernel.org
Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9b887d44b8d9..350607cce869 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, 
int cpu, int full)
pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page;
ret = !pagebusy && full_hit(buffer, cpu, full);
 
-   if (!cpu_buffer->shortest_full ||
-   cpu_buffer->shortest_full > full)
-   cpu_buffer->shortest_full = full;
+   if (!ret && (!cpu_buffer->shortest_full ||
+cpu_buffer->shortest_full > full)) {
+   cpu_buffer->shortest_full = full;
+   }
raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
}
return ret;
-- 
2.43.0

Re: [PATCH] ring-buffer: Do not set shortest_full when full target is hit

2024-03-12 Thread Google

On Tue, 12 Mar 2024 11:56:41 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The rb_watermark_hit() checks if the amount of data in the ring buffer is
> above the percentage level passed in by the "full" variable. If it is, it
> returns true.
> 
> But it also sets the "shortest_full" field of the cpu_buffer that informs
> writers that it needs to call the irq_work if the amount of data on the
> ring buffer is above the requested amount.
> 
> The rb_watermark_hit() always sets the shortest_full even if the amount in
> the ring buffer is what it wants. As it is not going to wait, because it
> has what it wants, there's no reason to set shortest_full.
> 

Yeah, it should avoid setting if !ret. Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> Cc: sta...@vger.kernel.org
> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 9b887d44b8d9..350607cce869 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer 
> *buffer, int cpu, int full)
>   pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page;
>   ret = !pagebusy && full_hit(buffer, cpu, full);
>  
> - if (!cpu_buffer->shortest_full ||
> - cpu_buffer->shortest_full > full)
> - cpu_buffer->shortest_full = full;
> + if (!ret && (!cpu_buffer->shortest_full ||
> +  cpu_buffer->shortest_full > full)) {
> + cpu_buffer->shortest_full = full;
> + }
>   raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
>   }
>   return ret;
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-12 Thread Mathieu Poirier

On Mon, 11 Mar 2024 at 05:44, Abdellatif El Khlifi
 wrote:
>
> Hi Mathieu,
>
> On Fri, Mar 08, 2024 at 09:44:26AM -0700, Mathieu Poirier wrote:
> > On Thu, 7 Mar 2024 at 12:40, Abdellatif El Khlifi
> >  wrote:
> > >
> > > Hi Mathieu,
> > >
> > > > > +   do {
> > > > > +   state_reg = readl(priv->reset_cfg.state_reg);
> > > > > +   *rst_ack = EXTSYS_RST_ST_RST_ACK(state_reg);
> > > > > +
> > > > > +   if (*rst_ack == EXTSYS_RST_ACK_RESERVED) {
> > > > > +   dev_err(dev, "unexpected RST_ACK value: 0x%x\n",
> > > > > +   *rst_ack);
> > > > > +   return -EINVAL;
> > > > > +   }
> > > > > +
> > > > > +   /* expected ACK value read */
> > > > > +   if ((*rst_ack & exp_ack) || (*rst_ack == exp_ack))
> > > >
> > > > I'm not sure why the second condition in this if() statement is needed. 
> > > >  As far
> > > > as I can tell the first condition will trigger and the second one won't 
> > > > be
> > > > reached.
> > >
> > > The second condition takes care of the following: exp_ack and  *rst_ack 
> > > are both 0.
> > > This case happens when RST_REQ bit is cleared (meaning: No reset 
> > > requested) and
> > > we expect the RST_ACK to be 00 afterwards.
> > >
> >
> > This is the kind of conditions that definitely deserve documentation.
> > Please split the conditions in two different if() statements and add a
> > comment to explain what is going on.
>
> Thanks, I'll address that.
>
> >
> > > > > +/**
> > > > > + * arm_rproc_load() - Load firmware to memory function for rproc_ops
> > > > > + * @rproc: pointer to the remote processor object
> > > > > + * @fw: pointer to the firmware
> > > > > + *
> > > > > + * Does nothing currently.
> > > > > + *
> > > > > + * Return:
> > > > > + *
> > > > > + * 0 for success.
> > > > > + */
> > > > > +static int arm_rproc_load(struct rproc *rproc, const struct firmware 
> > > > > *fw)
> > > > > +{
> > > >
> > > > What is the point of doing rproc_of_parse_firmware() if the firmware 
> > > > image is
> > > > not loaded to memory?  Does the remote processor have some kind of 
> > > > default ROM
> > > > image to run if it doesn't find anything in memory?
> > >
> > > Yes, the remote processor has a default FW image already loaded by 
> > > default.
> > >
> >
> > That too would have mandated a comment - otherwise people looking at
> > the code are left wondering, as I did.
> >
> > > rproc_boot() [1] and _request_firmware() [2] fail if there is no FW file 
> > > in the filesystem or a filename
> > > provided.
> > >
> > > Please correct me if I'm wrong.
> >
> > You are correct, the remoteproc subsystem expects a firmware image to
> > be provided _and_ loaded into memory.  Providing a dummy image just to
> > get the remote processor booted is a hack, but simply because the
> > subsystem isn't tailored to handle this use case.  So I am left
> > wondering what the plans are for this driver, i.e is this a real
> > scenario that needs to be addressed or just an initial patchset to get
> > a foundation for the driver.
> >
> > In the former case we need to start talking about refactoring the
> > subsystem so that it properly handles remote processors that don't
> > need a firmware image.  In the latter case I'd rather see a patchset
> > where the firmware image is loaded into RAM.
>
> This is an initial patchset for allowing to turn on and off the remote 
> processor.
> The FW is already loaded before the Corstone-1000 SoC is powered on and this
> is done through the FPGA board bootloader in case of the FPGA target. Or by 
> the Corstone-1000 FVP model
> (emulator).
>

>From the above I take it that booting with a preloaded firmware is a
scenario that needs to be supported and not just a temporary stage.

> The plan for the driver is as follows:
>
> Step 1: provide a foundation driver capable of turning the core on/off
>
> Step 2: provide mailbox support for comms
>
> Step 3: provide FW reload capability
>

What happens when a user wants to boot the remote processor with the
firmware provided on the file system rather than the one preloaded
into memory?  Furthermore, how do we account for scenarios where the
remote processor goes from running a firmware image on the file system
to the firmware image loaded by an external entity?  Is this a valid
scenario?

> Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) can 
> share memory with
> the remote core.
>
> I'm happy to provide more explanation in the commit log to reflect this 
> status.
>
> Is it OK that we go with step 1 as a foundation please ?
>

First let's clarify all the scenarios that need to be supported.  From
there I will advise on how to proceed and what modifications to the
subsystem's core should be made, if need be.

> Cheers
> Abdellatif

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-12 Thread David Woodhouse

On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> On 08.03.24 13:33, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > (sometimes) I still don't actually know what the time is, because some
> > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > to provide such?
> > > > 
> > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > expose incomplete information.
> > > > 
> > > 
> > > Hi David,
> > > 
> > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > 
> > > thank you for your insightful comments. I think I take a broadly similar
> > > view. The reason why the current spec and driver is like this is that I
> > > took a pragmatic approach at first and only included features which work
> > > out-of-the-box for the current Linux ecosystem.
> > > 
> > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > can work out-of-the-box with time sync daemons such as chrony.
> > > 
> > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > as well, I am afraid that
> > > 
> > > - in some (embedded) scenarios, the TAI clock may not be available
> > > 
> > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > 
> > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > offset to each readout. I don't know user-space software which would
> > > leverage this already (at least not through the PTP clock interface).
> > > And why would such software not go straight for the TAI clock instead?
> > > 
> > > How about adding a requirement to the spec that the virtio-rtc device
> > > SHOULD expose the TAI clock whenever it is available - would this
> > > address your concerns?
> > 
> > I think that would be too easy for implementors to miss, or decide not
> > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > putting UTC in it.
> > 
> > I think I prefer to mandate the tai_offset field with the UTC clock.
> > Crappy implementations will just set it to zero, but at least that
> > gives a clear signal to the guests that it's *their* problem to
> > resolve.
> 
> To me there are some open questions regarding how this would work. Is there
> a use case for this with the v3 clock reading methods, or would it be
> enough to address this with the Virtio timekeeper?
> 
> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> best alongside some additional information about leap seconds. I am not
> aware about any user-space user. In addition, leap second smearing should
> also be addressed.
> 

Is there even a standard yet for leap-smearing? Will it be linear over
1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
think is what Google does? Meta does something different again, don't
they?

Exposing UTC as the only clock reference is bad enough; when leap
seconds happen there's a whole second during which you don't *know*
which second it is. It seems odd to me, for a precision clock to be
deliberately ambiguous about what the time is!

But if the virtio-rtc clock is defined as UTC and then expose something
*different* in it, that's even worse. You potentially end up providing
inaccurate time for a whole *day* leading up to the leap second.

I think you're right that leap second smearing should be addressed. At
the very least, by making it clear that the virtio-rtc clock which
advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
yet-to-be-defined variant.

Please make it explicit that any hypervisor which wants to advertise a
smeared clock shall define a new type which specifies the precise
smearing algorithm and cannot be conflated with the one you're defining
here.

> > One other thing to note is I think we're being very naïve about the TSC
> > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > different frequency, and even if they run at the same frequency they
> > might be offset from each other. I'm happy to be naïve but I think we
> > should be *explicitly* so, and just say for example that it's defined
> > against vCPU0 so if other vCPUs are different then all bets are off.
> 
> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> have an opinion on how to represent this in a platform-independent way.

Well, it doesn't have a notion of TSCs either; you include that by
implicit reference don't you?



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-12 Thread Abdellatif El Khlifi

Hi Mathieu,

On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > This is an initial patchset for allowing to turn on and off the remote 
> > processor.
> > The FW is already loaded before the Corstone-1000 SoC is powered on and this
> > is done through the FPGA board bootloader in case of the FPGA target. Or by 
> > the Corstone-1000 FVP model
> > (emulator).
> >
> >From the above I take it that booting with a preloaded firmware is a
> scenario that needs to be supported and not just a temporary stage.

The current status of the Corstone-1000 SoC requires that there is
a preloaded firmware for the external core. Preloading is done externally
either through the FPGA bootloader or the emulator (FVP) before powering
on the SoC.

Corstone-1000 will be upgraded in a way that the A core running Linux is able
to share memory with the remote core and also being able to access the remote
core memory so Linux can copy the firmware to. This HW changes are still
under development.

This is why this patchset is relying on a preloaded firmware. And it's the step 
1
of adding remoteproc support for Corstone.

When the HW is ready, we will be able to avoid preloading the firmware
and the user can do the following:

1) Use a default firmware filename stated in the DT (firmware-name property),
that's the one remoteproc subsystem will use initially, load the firmware file
and start the remote core.

2) Then, the user can choose to use another firmware file:

echo stop >/sys/class/remoteproc/remoteproc0/state
echo -n new_firmware.elf > /sys/class/remoteproc/remoteproc0/firmware
echo start >/sys/class/remoteproc/remoteproc0/state

> > The plan for the driver is as follows:
> >
> > Step 1: provide a foundation driver capable of turning the core on/off
> >
> > Step 2: provide mailbox support for comms
> >
> > Step 3: provide FW reload capability
> >
> What happens when a user wants to boot the remote processor with the
> firmware provided on the file system rather than the one preloaded
> into memory?

We will support this scenario when the HW is upgraded and copying the firmware
to the remote core memory becomes possible.

> Furthermore, how do we account for scenarios where the
> remote processor goes from running a firmware image on the file system
> to the firmware image loaded by an external entity?  Is this a valid
> scenario?

No, this scenario won't apply when we get the HW upgrade. No need for an
external entity anymore. The firmware(s) will all be files in the linux 
filesystem.

> > Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) 
> > can share memory with
> > the remote core.
> >
> > I'm happy to provide more explanation in the commit log to reflect this 
> > status.
> >
> > Is it OK that we go with step 1 as a foundation please ?
> >
> 
> First let's clarify all the scenarios that need to be supported.  From
> there I will advise on how to proceed and what modifications to the
> subsystem's core should be made, if need be.

Thanks, I hope the answers above provide the information needed.

Cheers
Abdellatif

Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Tanmay Shah




On 3/12/24 7:13 AM, Krzysztof Kozlowski wrote:
> On 11/03/2024 18:59, Tanmay Shah wrote:
>> From: Radhey Shyam Pandey 
>> 
>> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
>> UltraScale+ platform. It will help in defining TCM in device-tree
>> and make it's access platform agnostic and data-driven.
>> 
>> Tightly-coupled memories(TCMs) are low-latency memory that provides
>> predictable instruction execution and predictable data load/store
>> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
>> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
>> 
>> The TCM resources(reg, reg-names and power-domain) are documented for
>> each TCM in the R5 node. The reg and reg-names are made as required
>> properties as we don't want to hardcode TCM addresses for future
>> platforms and for zu+ legacy implementation will ensure that the
>> old dts w/o reg/reg-names works and stable ABI is maintained.
>> 
>> It also extends the examples for TCM split and lockstep modes.
>> 
>> Signed-off-by: Radhey Shyam Pandey 
>> Signed-off-by: Tanmay Shah 
>> ---
>> 
>> Changes in v13:
>>   - Have power-domains property for lockstep case instead of
>> keeping it flexible.
>>   - Add "items:" list in power-domains property
> 
> 
> Reviewed-by: Krzysztof Kozlowski 

Hi Krzysztof,

Thanks for RB. I provided explanation of flexible power-domains in
previous patchset. I am happy to send new revision removing
minItems if you dis-agree.

Thanks.

> 
> Best regards,
> Krzysztof
>

Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-12 Thread Krzysztof Kozlowski

On 12/03/2024 18:42, Tanmay Shah wrote:
> 
> 
> On 3/12/24 7:13 AM, Krzysztof Kozlowski wrote:
>> On 11/03/2024 18:59, Tanmay Shah wrote:
>>> From: Radhey Shyam Pandey 
>>>
>>> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
>>> UltraScale+ platform. It will help in defining TCM in device-tree
>>> and make it's access platform agnostic and data-driven.
>>>
>>> Tightly-coupled memories(TCMs) are low-latency memory that provides
>>> predictable instruction execution and predictable data load/store
>>> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
>>> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
>>>
>>> The TCM resources(reg, reg-names and power-domain) are documented for
>>> each TCM in the R5 node. The reg and reg-names are made as required
>>> properties as we don't want to hardcode TCM addresses for future
>>> platforms and for zu+ legacy implementation will ensure that the
>>> old dts w/o reg/reg-names works and stable ABI is maintained.
>>>
>>> It also extends the examples for TCM split and lockstep modes.
>>>
>>> Signed-off-by: Radhey Shyam Pandey 
>>> Signed-off-by: Tanmay Shah 
>>> ---
>>>
>>> Changes in v13:
>>>   - Have power-domains property for lockstep case instead of
>>> keeping it flexible.
>>>   - Add "items:" list in power-domains property
>>
>>
>> Reviewed-by: Krzysztof Kozlowski 
> 
> Hi Krzysztof,
> 
> Thanks for RB. I provided explanation of flexible power-domains in
> previous patchset. I am happy to send new revision removing
> minItems if you dis-agree.

Thanks for the explanation, it sounds fine, thus patch LGTM.

Best regards,
Krzysztof

Re: [PATCH 1/3] ARM: dts: qcom: msm8974-sony-castor: Split into shinano-common

2024-03-12 Thread Konrad Dybcio





On 3/10/24 12:41, Luca Weiss wrote:

In preparation for adding the Sony Xperia Z3 smartphone, split the
common parts into shinano-common.dtsi.

No functional change intended.

Signed-off-by: Luca Weiss 
---


I give you the green light to also add newlines between subsequent
subnodes (e.g. under blsp2_i2c5) while at it ;)

Reviewed-by: Konrad Dybcio 

Konrad

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Dan Williams

Vishal Verma wrote:
> The dax driver incorrectly used driver-core device locks to protect
> internal dax region and dax device configuration structures. Replace the
> device lock usage with a local rwsem, one each for dax region
> configuration and dax device configuration. As a result of this
> conversion, no device_lock() usage remains in dax/bus.c.
> 
> Cc: Dan Williams 
> Reported-by: Greg Kroah-Hartman 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 220 
> ++
>  1 file changed, 157 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 1ff1ab5fa105..cb148f74ceda 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -12,6 +12,18 @@
>  
>  static DEFINE_MUTEX(dax_bus_lock);
>  
> +/*
> + * All changes to the dax region configuration occur with this lock held
> + * for write.
> + */
> +DECLARE_RWSEM(dax_region_rwsem);
> +
> +/*
> + * All changes to the dax device configuration occur with this lock held
> + * for write.
> + */
> +DECLARE_RWSEM(dax_dev_rwsem);
> +
>  #define DAX_NAME_LEN 30
>  struct dax_id {
>   struct list_head list;
> @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
>   u64 size = 0;
>   int i;
>  
> - device_lock_assert(&dev_dax->dev);
> + WARN_ON_ONCE(!rwsem_is_locked(&dax_dev_rwsem));

Apologies for the late review, but...

All of these WARN_ON_ONCE() usages should be replaced with
lockdep_assert_held() and lockdep_assert_held_write() where appropriate.

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Verma, Vishal L

On Tue, 2024-03-12 at 13:07 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > The dax driver incorrectly used driver-core device locks to protect
> > internal dax region and dax device configuration structures. Replace the
> > device lock usage with a local rwsem, one each for dax region
> > configuration and dax device configuration. As a result of this
> > conversion, no device_lock() usage remains in dax/bus.c.
> > 
> > Cc: Dan Williams 
> > Reported-by: Greg Kroah-Hartman 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/dax/bus.c | 220 
> > ++
> >  1 file changed, 157 insertions(+), 63 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..cb148f74ceda 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -12,6 +12,18 @@
> >  
> >  static DEFINE_MUTEX(dax_bus_lock);
> >  
> > +/*
> > + * All changes to the dax region configuration occur with this lock held
> > + * for write.
> > + */
> > +DECLARE_RWSEM(dax_region_rwsem);
> > +
> > +/*
> > + * All changes to the dax device configuration occur with this lock held
> > + * for write.
> > + */
> > +DECLARE_RWSEM(dax_dev_rwsem);
> > +
> >  #define DAX_NAME_LEN 30
> >  struct dax_id {
> >     struct list_head list;
> > @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
> >     u64 size = 0;
> >     int i;
> >  
> > -   device_lock_assert(&dev_dax->dev);
> > +   WARN_ON_ONCE(!rwsem_is_locked(&dax_dev_rwsem));
> 
> Apologies for the late review, but...
> 
> All of these WARN_ON_ONCE() usages should be replaced with
> lockdep_assert_held() and lockdep_assert_held_write() where appropriate.

Makes sense - I can send a patch post -rc1 to change these if that's okay 
Andrew?

Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem

2024-03-12 Thread Andrew Morton

On Tue, 12 Mar 2024 20:20:17 + "Verma, Vishal L"  
wrote:

> > All of these WARN_ON_ONCE() usages should be replaced with
> > lockdep_assert_held() and lockdep_assert_held_write() where appropriate.
> 
> Makes sense - I can send a patch post -rc1 to change these if that's okay 
> Andrew?

Please do.

[PATCH bpf-next 0/3] uprobes: two common case speed ups

2024-03-12 Thread Andrii Nakryiko

This patch set implements two speed ups for uprobe/uretprobe runtime execution
path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
system-wide (non-PID-specific) uprobes (patch #3). Please see individual
patches for details.

Given I haven't worked with uprobe code before, I'm unfamiliar with
conventions in this subsystem, including which kernel tree patches should be
sent to. For now I based all the changes on top of bpf-next/master, which is
where I tested and benchmarked everything anyways. Please advise what should
I use as a base for subsequent revision. Thanks.

Andrii Nakryiko (3):
  uprobes: encapsulate preparation of uprobe args buffer
  uprobes: prepare uprobe args buffer lazily
  uprobes: add speculative lockless system-wide uprobe filter check

 kernel/trace/trace_uprobe.c | 103 ++--
 1 file changed, 63 insertions(+), 40 deletions(-)

-- 
2.43.0

[PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-12 Thread Andrii Nakryiko

Move the logic of fetching temporary per-CPU uprobe buffer and storing
uprobes args into it to a new helper function. Store data size as part
of this buffer, simplifying interfaces a bit, as now we only pass single
uprobe_cpu_buffer reference around, instead of pointer + dsize.

This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher,
and now will be centralized. All this is also in preparation to make
this uprobe_cpu_buffer handling logic optional in the next patch.

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 75 -
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a84b85d8aac1..a0f60bb10158 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = {
 struct uprobe_cpu_buffer {
struct mutex mutex;
void *buf;
+   int dsize;
 };
 static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
 static int uprobe_buffer_refcnt;
@@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer 
*ucb)
mutex_unlock(&ucb->mutex);
 }
 
+static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
+  struct pt_regs *regs)
+{
+   struct uprobe_cpu_buffer *ucb;
+   int dsize, esize;
+
+   esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
+   dsize = __get_data_size(&tu->tp, regs);
+
+   ucb = uprobe_buffer_get();
+   ucb->dsize = dsize;
+
+   store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize);
+
+   return ucb;
+}
+
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb, int dsize,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
@@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
WARN_ON(call != trace_file->event_call);
 
-   if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE))
+   if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE))
return;
 
if (trace_trigger_soft_disabled(trace_file))
return;
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
-   size = esize + tu->tp.size + dsize;
+   size = esize + tu->tp.size + ucb->dsize;
entry = trace_event_buffer_reserve(&fbuffer, trace_file, size);
if (!entry)
return;
@@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
data = DATAOF_TRACE_ENTRY(entry, false);
}
 
-   memcpy(data, ucb->buf, tu->tp.size + dsize);
+   memcpy(data, ucb->buf, tu->tp.size + ucb->dsize);
 
trace_event_buffer_commit(&fbuffer);
 }
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
@@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, &tu->tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, &tu->tp)
-   __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
@@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
 
 static void __uprobe_perf_func(struct trace_uprobe *tu,
   unsigned long func, struct pt_regs *regs,
-  struct uprobe_cpu_buffer *ucb, int dsize)
+  struct uprobe_cpu_buffer *ucb)
 {
struct trace_event_call *call = trace_probe_event_call(&tu->tp);
struct uprobe_trace_entry_head *entry;
@@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
 
-   size = esize + tu->tp.size + d

[PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-12 Thread Andrii Nakryiko

uprobe_cpu_buffer and corresponding logic to store uprobe args into it
are used for uprobes/uretprobes that are created through tracefs or
perf events.

BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
need uprobe_cpu_buffer and associated data. For BPF-only use cases this
buffer handling and preparation is a pure overhead. At the same time,
BPF-only uprobe/uretprobe usage is very common in practice. Also, for
a lot of cases applications are very senstivie to performance overheads,
as they might be tracing a very high frequency functions like
malloc()/free(), so every bit of performance improvement matters.

All that is to say that this uprobe_cpu_buffer preparation is an
unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
This patch is changing this by making uprobe_cpu_buffer preparation
optional. It will happen only if either tracefs-based or perf event-based
uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
BPF-only use cases this step will be skipped.

We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
to estimate the improvements. We have 3 uprobe and 3 uretprobe
scenarios, which vary an instruction that is replaced by uprobe: nop
(fastest uprobe case), `push rbp` (typical case), and non-simulated
`ret` instruction (slowest case). Benchmark thread is constantly calling
user space function in a tight loop. User space function has attached
BPF uprobe or uretprobe program doing nothing but atomic counter
increments to count number of triggering calls. Benchmark emits
throughput in millions of executions per second.

BEFORE these changes

uprobe-nop :2.657 ± 0.024M/s
uprobe-push:2.499 ± 0.018M/s
uprobe-ret :1.100 ± 0.006M/s
uretprobe-nop  :1.356 ± 0.004M/s
uretprobe-push :1.317 ± 0.019M/s
uretprobe-ret  :0.785 ± 0.007M/s

AFTER these changes
===
uprobe-nop :2.732 ± 0.022M/s (+2.8%)
uprobe-push:2.621 ± 0.016M/s (+4.9%)
uprobe-ret :1.105 ± 0.007M/s (+0.5%)
uretprobe-nop  :1.396 ± 0.007M/s (+2.9%)
uretprobe-push :1.347 ± 0.008M/s (+2.3%)
uretprobe-ret  :0.800 ± 0.006M/s (+1.9)

So the improvements on this particular machine seems to be between 2% and 5%.

  [0] 
https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 56 ++---
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a0f60bb10158..f2875349d124 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -963,15 +963,22 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb,
+   struct uprobe_cpu_buffer **ucbp,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
+   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(&tu->tp);
 
+   ucb = *ucbp;
+   if (!ucb) {
+   ucb = prepare_uprobe_buffer(tu, regs);
+   *ucbp = ucb;
+   }
+
WARN_ON(call != trace_file->event_call);
 
if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE))
@@ -1002,7 +1009,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
@@ -1011,7 +1018,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, &tu->tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucbp, link->file);
rcu_read_unlock();
 
return 0;
@@ -1019,13 +1026,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, &tu->tp)
-   __uprobe_trace_func(tu, func, regs, ucb, link->file);
+   __uprobe_trace_func(tu, func, regs, ucbp, link-

[PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-12 Thread Andrii Nakryiko

It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch adds a speculative check before grabbing that rwlock. If
nr_systemwide is non-zero, lock is skipped and event is passed through.
>From examining existing logic it looks correct and safe to do. If
nr_systemwide is being modified under rwlock in parallel, we have to
consider basically just one important race condition: the case when
nr_systemwide is dropped from one to zero (from
trace_uprobe_filter_remove()) under filter->rwlock, but
uprobe_perf_filter() raced and saw it as >0.

In this case, we'll proceed with uprobe/uretprobe execution, while
uprobe_perf_close() and uprobe_apply() will be blocked on trying to grab
uprobe->register_rwsem as a writer. It will be blocked because
uprobe_dispatcher() (and, similarly, uretprobe_dispatcher()) runs with
uprobe->register_rwsem taken as a reader. So there is no real race
besides uprobe/uretprobe might execute one last time before it's
removed, which is fine because from user space perspective
uprobe/uretprobe hasn't been yet deactivated.

In case we speculatively read nr_systemwide as zero, while it was
incremented in parallel, we'll proceed to grabbing filter->rwlock and
re-doing the check, this time in lock-protected and non-racy way.

As such, it looks safe to do a quick short circuiting check and save
some performance in a very common system-wide case, not sacrificing hot
path performance due to much rarer possibility of registration or
unregistration of uprobes.

Again, confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===
uprobe-nop :2.732 ± 0.022M/s
uprobe-push:2.621 ± 0.016M/s
uprobe-ret :1.105 ± 0.007M/s
uretprobe-nop  :1.396 ± 0.007M/s
uretprobe-push :1.347 ± 0.008M/s
uretprobe-ret  :0.800 ± 0.006M/s

AFTER
=
uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f2875349d124..be28e6d0578e 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
*uc,
tu = container_of(uc, struct trace_uprobe, consumer);
filter = tu->tp.event->filter;
 
+   /* speculative check */
+   if (READ_ONCE(filter->nr_systemwide))
+   return true;
+
read_lock(&filter->rwlock);
ret = __uprobe_perf_filter(filter, mm);
read_unlock(&filter->rwlock);
-- 
2.43.0

[PATCH v4] ring-buffer: Have mmapped ring buffer keep track of missed events

2024-03-12 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

While testing libtracefs on the mmapped ring buffer, the test that checks
if missed events are accounted for failed when using the mapped buffer.
This is because the mapped page does not update the missed events that
were dropped because the writer filled up the ring buffer before the
reader could catch it.

Add the missed events to the reader page/sub-buffer when the IOCTL is done
and a new reader page is acquired.

Note that all accesses to the reader_page via rb_page_commit() had to be
switched to rb_page_size(), and rb_page_size() which was just a copy of
rb_page_commit() but now it masks out the RB_MISSED bits. This is needed
as the mapped reader page is still active in the ring buffer code and
where it reads the commit field of the bpage for the size, it now must
mask it otherwise the missed bits that are now set will corrupt the size
returned.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v3: 
https://lore.kernel.org/linux-trace-kernel/20240109212702.5b8e6...@gandalf.local.home

- Almost forgot to add this, but one of my libtracefs tests failed due to it.

- Rebased against the latest code.

- Removed the zeroing out at the end of the page as the ring buffer is now
  zeroed out when allocated. Not too worried about stale tracing data, but
  having non tracing data would be a problem.

 kernel/trace/ring_buffer.c | 53 +-
 1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 13250856a3a8..abe21c47fb49 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -313,6 +313,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event 
*event)
 /* Missed count stored at end */
 #define RB_MISSED_STORED   (1 << 30)
 
+#define RB_MISSED_MASK (3 << 30)
+
 struct buffer_data_page {
u64  time_stamp;/* page time stamp */
local_t  commit;/* write committed index */
@@ -2274,7 +2276,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 /* Size is determined by what has been committed */
 static __always_inline unsigned rb_page_size(struct buffer_page *bpage)
 {
-   return rb_page_commit(bpage);
+   return rb_page_commit(bpage) & ~RB_MISSED_MASK;
 }
 
 static __always_inline unsigned
@@ -3901,7 +3903,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu 
*cpu_buffer)
return true;
 
/* Reader should exhaust content in reader page */
-   if (reader->read != rb_page_commit(reader))
+   if (reader->read != rb_page_size(reader))
return false;
 
/*
@@ -4372,7 +4374,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter)
return ((iter->head_page == commit_page && iter->head >= commit) ||
(iter->head_page == reader && commit_page == head_page &&
 head_page->read == commit &&
-iter->head == rb_page_commit(cpu_buffer->reader_page)));
+iter->head == rb_page_size(cpu_buffer->reader_page)));
 }
 EXPORT_SYMBOL_GPL(ring_buffer_iter_empty);
 
@@ -5701,7 +5703,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
event = rb_reader_event(cpu_buffer);
 
read = reader->read;
-   commit = rb_page_commit(reader);
+   commit = rb_page_size(reader);
 
/* Check if any events were dropped */
missed_events = cpu_buffer->lost_events;
@@ -5778,7 +5780,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
} else {
/* update the entry counter */
cpu_buffer->read += rb_page_entries(reader);
-   cpu_buffer->read_bytes += rb_page_commit(reader);
+   cpu_buffer->read_bytes += rb_page_size(reader);
 
/* swap the pages */
rb_init_page(bpage);
@@ -6331,6 +6333,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer 
*buffer, int cpu,
 int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *reader;
+   unsigned long missed_events;
unsigned long reader_size;
unsigned long flags;
 
@@ -6356,9 +6360,46 @@ int ring_buffer_map_get_reader(struct trace_buffer 
*buffer, int cpu)
goto out;
}
 
-   if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
+   reader = rb_get_reader_page(cpu_buffer);
+   if (WARN_ON(!reader))
goto out;
 
+   /* Check if any events were dropped */
+   missed_events = cpu_buffer->lost_events;
+
+   if (cpu_buffer->reader_page != cpu_buffer->commit_page) {
+   if (missed_events) {
+   struct buffer_data_page *bpage = reader->page;
+   unsigned int commit;
+   /*
+* Use the real_end for the data size,
+* This gives us a chance to

Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-03-12 Thread Luis Chamberlain

On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote:
> On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote:
> > 
> > While you're at it, if you want to try it, you could see if you can
> > improve the situation more by looking at symbol_get() users that remain
> > and seeing if you can instead fix it with proper Kconfig dependency and
> > at build time. Then we can just remove it as well.
> > 
> >   Luis
> 
> Sorry for the late reply.
> 
> Luis, can you give more details of your idea? I re-read it once, then
> came back and still don't understand.
> 
> I see that there are ~10 users for symbol_get() currently. Do you want
> to stringify symbol names at build time to completely remove
> symbol_get() from module.h? Correct me if I'm wrong since using of a
> fuction which is not declared anywhere sounds confusing.

As an example look at the code and see if there's a sensible way to make
some calls built-in instead of part of the module, then the module can
have a kconfig builtin option, that adds to the built-in code which
means you don't need the symbol_get().

For some other pieces of code it may require other strategies.

  Luis

Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-03-12 Thread Luis Chamberlain

On Tue, Mar 12, 2024 at 03:25:27PM -0700, Luis Chamberlain wrote:
> On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote:
> > On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote:
> > > 
> > > While you're at it, if you want to try it, you could see if you can
> > > improve the situation more by looking at symbol_get() users that remain
> > > and seeing if you can instead fix it with proper Kconfig dependency and
> > > at build time. Then we can just remove it as well.
> > > 
> > >   Luis
> > 
> > Sorry for the late reply.
> > 
> > Luis, can you give more details of your idea? I re-read it once, then
> > came back and still don't understand.
> > 
> > I see that there are ~10 users for symbol_get() currently. Do you want
> > to stringify symbol names at build time to completely remove
> > symbol_get() from module.h? Correct me if I'm wrong since using of a
> > fuction which is not declared anywhere sounds confusing.
> 
> As an example look at the code and see if there's a sensible way to make
> some calls built-in instead of part of the module, then the module can
> have a kconfig builtin option, that adds to the built-in code which
> means you don't need the symbol_get().
> 
> For some other pieces of code it may require other strategies.

An example is FW_LOADER_USER_HELPER which is bool only, and is selected
by users. It didn't use symbol_get() before, however its an example of
how through Kconfig you can align requirements and define built-in
components, even if they do come from a module.

  Luis

[GIT PULL] Modules changes for v6.9-rc1

2024-03-12 Thread Luis Chamberlain

The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3:

  Linux 6.8-rc2 (2024-01-28 17:01:12 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.9-rc1

for you to fetch changes up to d1909c0221739356f31c721de4743e7d219a56cc:

  module: Don't ignore errors from set_memory_XX() (2024-02-16 11:30:43 -0800)


Modules changes for v6.9-rc1

Christophe Leroy did most of the work on this release, first with a few
cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error handling for
when set_memory_XX() can fail. This is part of a larger effort to clean
up all these callers which can fail, modules is just part of it.

This has been sitting on linux-next for about a month without issues.


Christophe Leroy (6):
  module: Use set_memory_rox()
  module: Change module_enable_{nx/x/ro}() to more explicit names
  init: Declare rodata_enabled and mark_rodata_ro() at all time
  modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled
  powerpc: Simplify strict_kernel_rwx_enabled()
  module: Don't ignore errors from set_memory_XX()

Randy Dunlap (1):
  lib/test_kmod: fix kernel-doc warnings

 arch/powerpc/include/asm/mmu.h |  9 +-
 include/linux/init.h   |  4 ---
 init/main.c| 21 +-
 kernel/module/internal.h   |  6 ++--
 kernel/module/main.c   | 20 +++---
 kernel/module/strict_rwx.c | 63 +++---
 lib/test_kmod.c|  6 +++-
 7 files changed, 73 insertions(+), 56 deletions(-)

47 matches

Mail list logo