Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info
"Ho-Ren (Jack) Chuang" writes: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * Abstract common functions into `find_alloc_memory_type()` We should move kmem_put_memory_types() (renamed to mt_put_memory_types()?) too. This can be put in a separate patch. > Since different memory devices require finding or allocating a memory type, > these common steps are abstracted into a single function, > `find_alloc_memory_type()`, enhancing code scalability and conciseness. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Change adist calculation code to use another new lock, mt_perf_lock. > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `mt_perf_lock` to protect > `default_dram_perf`. This approach not only avoids deadlock but also > prevents holding a large lock simultaneously. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > --- > drivers/acpi/numa/hmat.c | 11 ++ > drivers/dax/kmem.c | 13 +-- > include/linux/acpi.h | 6 > include/linux/memory-tiers.h | 8 + > mm/memory-tiers.c| 70 +--- > 5 files changed, 92 insertions(+), 16 deletions(-) > > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c > index d6b85f0f6082..28812ec2c793 100644 > --- a/drivers/acpi/numa/hmat.c > +++ b/drivers/acpi/numa/hmat.c > @@ -38,6 +38,8 @@ static LIST_HEAD(targets); > static LIST_HEAD(initiators); > static LIST_HEAD(localities); > > +static LIST_HEAD(hmat_memory_types); > + HMAT isn't a device driver for some memory devices. So I don't think we should manage memory types in HMAT. Instead, if the memory_type of a node isn't set by the driver, we should manage it in memory-tier.c as fallback. > static DEFINE_MUTEX(target_lock); > > /* > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid, > } > EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL); > > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist) > +{ > + return find_alloc_memory_type(adist, &hmat_memory_types); > +} > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type); > + > static __init void alloc_memory_initiator(unsigned int cpu_pxm) > { > struct memory_initiator *initiator; > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void) > if (!hmat_set_default_dram_perf()) > register_mt_adistance_algorithm(&hmat_adist_nb); > > + /* Post-create CPUless memory tiers after getting HMAT info */ > + memory_tier_late_init(); > + This should be called in memory-tier.c via late_initcall(memory_tier_late_init); Then, we don't need hmat to call it. > return 0; > out_put: > hmat_free_structures(); > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 42ee360cf4e3..aee17ab59f4f 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); > > static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) > { > - bool found = false; > struct memory_dev_type *mtype; > > mutex_lock(&kmem_memory_type_lock); > - list_for_each_entry(mtype, &kmem_memory_types, list) { > - if (mtype->adistance == adist) { > - found = true; > - break; > - } > - } > - if (!found) { > - mtype = alloc_memory_type(adist); > - if (!IS_ERR(mtype)) > - list_add(&mtype->list, &kmem_memory_types); > - } > + mtype = find_alloc_memory_type(adist, &kmem_memory_types); > mutex_unlock(&kmem_memory_type_lock); > > return mtype; > diff --git a/include/linux/acpi.h b/include/linux/acpi.h > index b7165e52b3c6..3f927ff01f02 100644 > --- a/include/linux/acpi.h > +++ b/include/linux/acpi.h > @@ -434,12 +434,18 @@ int thermal_acpi_critical_trip_temp(struct acpi_device > *adev, int *ret_temp); > > #ifdef CONFIG_ACPI_HMAT > int acpi_get_genport_coordinates(u32 uid, struct access_coordinate *coord); > +struct memory_dev_type *hmat_fi
Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
On 3/11/24 17:05, Michael S. Tsirkin wrote: > On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote: >> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote: >>> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote: On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote: > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote: >> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote: Summary In my (non-vhost experience) opinion the way to go would be either replacing the cond_resched with a hard schedule or setting the need_resched flag within vhost if the a data transfer was successfully initiated. It will be necessary to check if this causes problems with other workloads/benchmarks. >>> >>> Yes but conceptually I am still in the dark on whether the fact that >>> periodically invoking cond_resched is no longer sufficient to be nice to >>> others is a bug, or intentional. So you feel it is intentional? >> >> I would assume that cond_resched is still a valid concept. >> But, in this particular scenario we have the following problem: >> >> So far (with CFS) we had: >> 1. vhost initiates data transfer >> 2. kworker is woken up >> 3. CFS gives priority to woken up task and schedules it >> 4. kworker runs >> >> Now (with EEVDF) we have: >> 0. In some cases, kworker has accumulated negative lag >> 1. vhost initiates data transfer >> 2. kworker is woken up >> -3a. EEVDF does not schedule kworker if it has negative lag >> -4a. vhost continues running, kworker on same CPU starves >> -- >> -3b. EEVDF schedules kworker if it has positive or no lag >> -4b. kworker runs >> >> In the 3a/4a case, the kworker is given no chance to set the >> necessary flag. The flag can only be set by another CPU now. >> The schedule of the kworker was not caused by cond_resched, but >> rather by the wakeup path of the scheduler. >> >> cond_resched works successfully once the load balancer (I suppose) >> decides to migrate the vhost off to another CPU. In that case, the >> load balancer on another CPU sets that flag and we are good. >> That then eventually allows the scheduler to pick kworker, but very >> late. > > Are we going anywhere with this btw? > > I think Tobias had a couple other threads related to this, with other potential fixes: https://lore.kernel.org/lkml/20240228161018.14253-1-husc...@linux.ibm.com/ https://lore.kernel.org/lkml/20240228161023.14310-1-husc...@linux.ibm.com/
Re: [PATCHv2 1/2] dt-bindings: usb: typec: anx7688: start a binding document
On 11/03/2024 22:22, Pavel Machek wrote: > Hi! > >> Add binding for anx7688 usb type-c bridge. I don't have a datasheet, >> but I did best I could. >> >> Signed-off-by: Pavel Machek > > Any more comments here? Automatic system told me I need to replace one > character... Well, I often do not review patches which have build failure so were not compiled. In the terms of bindings `dt_binding_check` is like test compilation of C, so just like no one reviews onbuildable C code, I don't usually look at "unbuildable" bindings. Plus original v2 patch went into some other email thread, because of "References" field, so a bit disappeared from my inbox. Best regards, Krzysztof
[PATCH 1/2] ring-buffer: Fix full_waiters_pending in poll
From: "Steven Rostedt (Google)" If a reader of the ring buffer is doing a poll, and waiting for the ring buffer to hit a specific watermark, there could be a case where it gets into an infinite ping-pong loop. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 - - [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] In the poll, the shortest_full needs to be set before the full_pending_waiters, as once that is set, the writer will compare the current shortest_full (which is incorrect) to decide to call the irq_work, which will reset the shortest_full (expecting the readers to update it). Also move the setting of full_waiters_pending after the check if the ring buffer has the required percentage filled. There's no reason to tell the writer to wake up waiters if there are no waiters. Cc: sta...@vger.kernel.org Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 27 --- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index aa332ace108b..adfe603a769b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, poll_wait(filp, &rbwork->full_waiters, poll_table); raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - } else { - poll_wait(filp, &rbwork->waiters, poll_table); - rbwork->waiters_pending = true; + if (full_hit(buffer, cpu, full)) + return EPOLLIN | EPOLLRDNORM; + /* +* Only allow full_waiters_pending update to be seen after +* the shortest_full is set. If the writer sees the +* full_waiters_pending flag set, it will compare the +* amount in the ring buffer to shortest_full. If the amount +* in the ring buffer is greater than the shortest_full +* percent, it will call the irq_work handler to wake up +* this list. The irq_handler will reset shortest_full +* back to zero. That's done under the reader_lock, but +* the below smp_mb() makes sure that the update to +* full_waiters_pending doesn't leak up into the above. +*/ + smp_mb(); + rbwork->full_waiters_pending = true; + return 0; } + poll_wait(filp, &rbwork->waiters, poll_table); + rbwork->waiters_pending = true; + /* * There's a tight race between setting the waiters_pending and * checking if the ring buffer is empty. Once the waiters_pending bit @@ -989,9 +1005,
[PATCH 0/2] ring-buffer: Fix poll wakeup logic
After making a slight change to wakeups in ring_buffer_wait() the system would hang. Spending several hours going on a wild goose chase I found that the change only triggered the bug because it changed the timings. The bug was there before the update but never was triggered. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 - - [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] The race was triggered when running: trace-cmd record -p function -m 5000 Which enables function tracing and then creates two files it is writing into where each is 2500K in size. The -m is a "max file size". When trace-cmd writes 2500K to one file it then switches to the other, erasing the old data. To do this, trace-cmd switches between both poll and the reader using both methods of wake up. The change to the reader wakeup was able to change the way the poll was woken to trigger this bug. The second patch is a clean up and also a way to consolidate the logic of the shortest_full. The read wakeup uses rb_watermark_hit for both full wakeups and !full wakeups. But since poll uses the same logic for full wakeups it can just call that function with full set. Steven Rostedt (Google) (2): ring-buffer: Fix full_waiters_pending in poll ring-buffer: Reuse rb_watermark_hit() for the poll logic kernel/trace/ring_buffer.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-)
[PATCH 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic
From: "Steven Rostedt (Google)" The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index adfe603a769b..6ef763f57c66 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -963,21 +963,16 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, poll_wait(filp, &rbwork->full_waiters, poll_table); - raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - if (!cpu_buffer->shortest_full || - cpu_buffer->shortest_full > full) - cpu_buffer->shortest_full = full; - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - if (full_hit(buffer, cpu, full)) + if (rb_watermark_hit(buffer, cpu, full)) return EPOLLIN | EPOLLRDNORM; /* * Only allow full_waiters_pending update to be seen after -* the shortest_full is set. If the writer sees the -* full_waiters_pending flag set, it will compare the -* amount in the ring buffer to shortest_full. If the amount -* in the ring buffer is greater than the shortest_full -* percent, it will call the irq_work handler to wake up -* this list. The irq_handler will reset shortest_full +* the shortest_full is set (in rb_watermark_hit). If the +* writer sees the full_waiters_pending flag set, it will +* compare the amount in the ring buffer to shortest_full. +* If the amount in the ring buffer is greater than the +* shortest_full percent, it will call the irq_work handler +* to wake up this list. The irq_handler will reset shortest_full * back to zero. That's done under the reader_lock, but * the below smp_mb() makes sure that the update to * full_waiters_pending doesn't leak up into the above. -- 2.43.0
[PATCH v3 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
From: "Steven Rostedt (Google)" Convert ring_buffer_wait() over to wait_event_interruptible(). The default condition is to execute the wait loop inside __wait_event() just once. This does not change the ring_buffer_wait() prototype yet, but restructures the code so that it can take a "cond" and "data" parameter and will call wait_event_interruptible() with a helper function as the condition. The helper function (rb_wait_cond) takes the cond function and data parameters. It will first check if the buffer hit the watermark defined by the "full" parameter and then call the passed in condition parameter. If either are true, it returns true. If rb_wait_cond() does not return true, it will set the appropriate "waiters_pending" flag and returns false. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 1 + kernel/trace/ring_buffer.c | 116 +--- 2 files changed, 69 insertions(+), 48 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..338a33db1577 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k __ring_buffer_alloc((size), (flags), &__key); \ }) +typedef bool (*ring_buffer_cond_fn)(void *data); int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6ef763f57c66..c198ba466853 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, int cpu, int full) return ret; } -/** - * ring_buffer_wait - wait for input to the ring buffer - * @buffer: buffer to wait on - * @cpu: the cpu buffer to wait on - * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS - * - * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon - * as data is added to any of the @buffer's cpu buffers. Otherwise - * it will wait for data to be added to a specific cpu buffer. - */ -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) +static inline bool +rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer, +int cpu, int full, ring_buffer_cond_fn cond, void *data) { - struct ring_buffer_per_cpu *cpu_buffer; - DEFINE_WAIT(wait); - struct rb_irq_work *work; - int ret = 0; - - /* -* Depending on what the caller is waiting for, either any -* data in any cpu buffer, or a specific buffer, put the -* caller on the appropriate wait queue. -*/ - if (cpu == RING_BUFFER_ALL_CPUS) { - work = &buffer->irq_work; - /* Full only makes sense on per cpu reads */ - full = 0; - } else { - if (!cpumask_test_cpu(cpu, buffer->cpumask)) - return -ENODEV; - cpu_buffer = buffer->buffers[cpu]; - work = &cpu_buffer->irq_work; - } + if (rb_watermark_hit(buffer, cpu, full)) + return true; - if (full) - prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE); - else - prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); + if (cond(data)) + return true; /* * The events can happen in critical sections where @@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) * a task has been queued. It's OK for spurious wake ups. */ if (full) - work->full_waiters_pending = true; + rbwork->full_waiters_pending = true; else - work->waiters_pending = true; + rbwork->waiters_pending = true; - if (rb_watermark_hit(buffer, cpu, full)) - goto out; + return false; +} - if (signal_pending(current)) { - ret = -EINTR; - goto out; +/* + * The default wait condition for ring_buffer_wait() is to just to exit the + * wait loop the first time it is woken up. + */ +static bool rb_wait_once(void *data) +{ + long *once = data; + + /* wait_event() actually calls this twice before scheduling*/ + if (*once > 1) + return true; + + (*once)++; + return false; +} + +/** + * ring_buffer_wait - wait for input to the ring buffer + * @buffer: buffer to wait on + * @cpu: the cpu buffer to
[PATCH v3 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters
A patch was sent to "fix" the wait_index variable that is used to help with waking of waiters on the ring buffer. The patch was rejected, but I started looking at associated code. Discussing it on IRC with Mathieu Desnoyers we discovered a design flaw. The waiter reads "wait_index" then enters a "wait loop". After adding itself to the wait queue, if the buffer is filled or a signal is pending it will exit out. If wait_index is different, it will also exit out. (Note, I noticed that the check for wait_index was after the schedule() call and should have been before it, but that's besides the point). The race is what happens if the waker updates the wait_index, a new waiter comes in and reads the updated wait_index before it adds itself to the queue, it will miss the update! These patches fix the design by converting the ring_buffer_wait() to use the wait_event_interruptible() interface. Then the wait_to_pipe() caller will pass its own condition into the ring_buffer_wait() to be checked by the wait_event_interruptible(). - The first patch restructures the ring_buffer_wait() to use the wait_event_interruptible() logic. It does not change the interface, but adds a local "cond" condition callback function that will be what it defaults to in the second patch if NULL is passed in. The default cond function is just a "wait once" function. That is, the first time it is called (before the wait_event_interruptible() calls schedule) will set the "once" variable to one and return false. The next time it is called (after wait_event_interruptible() wakes up) it will return true to break out of the loop. - The second patch changes the ring_buffer_wait() interface to allow the caller to define its own "cond" callback. That will be checked in wait_event_interruptible() along with checking if the proper amount of data in the ring buffer has been hit. Changes since v2: https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/ - Patches 1-3 of v2 have been accepted. - Instead of breaking ring_buffer_wait() into three functions that do a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait() take a condition callback function and a data parameter. The ring_buffer_wait() now uses a wait_event_interruptible() code, and added a helper function to do check its own conditions, and also to call the passed in condition function and allow the caller to specify other conditions to break out of the event loop. Changes since v1: https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/ - My tests triggered a warning about calling a mutex_lock() after a prepare_to_wait() that changed the task's state. Convert the affected mutex over to a spinlock. Steven Rostedt (Google) (2): ring-buffer: Use wait_event_interruptible() in ring_buffer_wait() tracing/ring-buffer: Fix wait_on_pipe() race include/linux/ring_buffer.h | 4 +- include/linux/trace_events.h | 5 +- kernel/trace/ring_buffer.c | 117 +-- kernel/trace/trace.c | 43 +++- 4 files changed, 107 insertions(+), 62 deletions(-)
[PATCH v3 2/2] tracing/ring-buffer: Fix wait_on_pipe() race
From: "Steven Rostedt (Google)" When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 11 - kernel/trace/trace.c | 43 ++-- 4 files changed, 43 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577..dc5ae4e96aee 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f..fc6d0af56bb1 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index c198ba466853..81a5303bdc09 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -906,18 +906,19 @@ static bool rb_wait_once(void *data) * as data is added to any of the @buffer's cpu buffers. Otherwise * it will wait for data to be added to a specific cpu buffer. */ -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data) { struct ring_buffer_per_cpu *cpu_buffer; struct wait_queue_head *waitq; - ring_buffer_cond_fn cond; struct rb_irq_work *rbwork; - void *data; long once = 0; int ret = 0; - cond = rb_wait_once; - data = &once; + if (!cond) { + cond = rb_wait_once; + data = &once; + } /* * Depending on what the calle
Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 11/03/2024 17:27, Tanmay Shah wrote: >>> +then: >>> + patternProperties: >>> +"^r5f@[0-9a-f]+$": >>> + type: object >>> + >>> + properties: >>> +reg: >>> + minItems: 1 >>> + items: >>> +- description: ATCM internal memory >>> +- description: BTCM internal memory >>> +- description: extra ATCM memory in lockstep mode >>> +- description: extra BTCM memory in lockstep mode >>> + >>> +reg-names: >>> + minItems: 1 >>> + items: >>> +- const: atcm0 >>> +- const: btcm0 >>> +- const: atcm1 >>> +- const: btcm1 >> >> Why power domains are flexible? > > User may not want to use all the TCMs. For example, if users want to turn-on > only TCM-A and rest of them want to keep off, then > > they can avoid having power-domains of other TCMs in the device-tree. This > helps with less power-consumption when needed. > > Hence flexible list of power-domains list. > Isn't turning on/off driver's job? Sorry, but what is "user" here? DTS describes bindings, not OS policy. Also, please wrap your replies to match email style. > I can certainly mention "items:" under power-domains property. > > >> Best regards, Krzysztof
Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 11/03/2024 19:39, Tanmay Shah wrote: >>> + >>> +else: >>> + patternProperties: >>> +"^r5f@[0-9a-f]+$": >>> + type: object >>> + >>> + properties: >>> +reg: >>> + minItems: 1 >>> + items: >>> +- description: ATCM internal memory >>> +- description: BTCM internal memory >>> + >>> +reg-names: >>> + minItems: 1 >>> + items: >>> +- const: atcm0 >>> +- const: btcm0 >>> + >>> +power-domains: >>> + maxItems: 3 >> >> Please list power domains. > > Hello, > > Sent v13 addressing both comments. > And gave me exactly two hours to disagree? Best regards, Krzysztof
Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 11/03/2024 18:59, Tanmay Shah wrote: > From: Radhey Shyam Pandey > > Introduce bindings for TCM memory address space on AMD-xilinx Zynq > UltraScale+ platform. It will help in defining TCM in device-tree > and make it's access platform agnostic and data-driven. > > Tightly-coupled memories(TCMs) are low-latency memory that provides > predictable instruction execution and predictable data load/store > timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory > banks on the ATCM and BTCM ports, for a total of 128 KB of memory. > > The TCM resources(reg, reg-names and power-domain) are documented for > each TCM in the R5 node. The reg and reg-names are made as required > properties as we don't want to hardcode TCM addresses for future > platforms and for zu+ legacy implementation will ensure that the > old dts w/o reg/reg-names works and stable ABI is maintained. > > It also extends the examples for TCM split and lockstep modes. > > Signed-off-by: Radhey Shyam Pandey > Signed-off-by: Tanmay Shah > --- > > Changes in v13: > - Have power-domains property for lockstep case instead of > keeping it flexible. > - Add "items:" list in power-domains property Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
[PATCH v4 1/2] ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()
From: "Steven Rostedt (Google)" Convert ring_buffer_wait() over to wait_event_interruptible(). The default condition is to execute the wait loop inside __wait_event() just once. This does not change the ring_buffer_wait() prototype yet, but restructures the code so that it can take a "cond" and "data" parameter and will call wait_event_interruptible() with a helper function as the condition. The helper function (rb_wait_cond) takes the cond function and data parameters. It will first check if the buffer hit the watermark defined by the "full" parameter and then call the passed in condition parameter. If either are true, it returns true. If rb_wait_cond() does not return true, it will set the appropriate "waiters_pending" flag and returns false. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 1 + kernel/trace/ring_buffer.c | 116 +--- 2 files changed, 69 insertions(+), 48 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..338a33db1577 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -98,6 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k __ring_buffer_alloc((size), (flags), &__key); \ }) +typedef bool (*ring_buffer_cond_fn)(void *data); int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6ef763f57c66..c198ba466853 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -842,43 +842,15 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, int cpu, int full) return ret; } -/** - * ring_buffer_wait - wait for input to the ring buffer - * @buffer: buffer to wait on - * @cpu: the cpu buffer to wait on - * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS - * - * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon - * as data is added to any of the @buffer's cpu buffers. Otherwise - * it will wait for data to be added to a specific cpu buffer. - */ -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) +static inline bool +rb_wait_cond(struct rb_irq_work *rbwork, struct trace_buffer *buffer, +int cpu, int full, ring_buffer_cond_fn cond, void *data) { - struct ring_buffer_per_cpu *cpu_buffer; - DEFINE_WAIT(wait); - struct rb_irq_work *work; - int ret = 0; - - /* -* Depending on what the caller is waiting for, either any -* data in any cpu buffer, or a specific buffer, put the -* caller on the appropriate wait queue. -*/ - if (cpu == RING_BUFFER_ALL_CPUS) { - work = &buffer->irq_work; - /* Full only makes sense on per cpu reads */ - full = 0; - } else { - if (!cpumask_test_cpu(cpu, buffer->cpumask)) - return -ENODEV; - cpu_buffer = buffer->buffers[cpu]; - work = &cpu_buffer->irq_work; - } + if (rb_watermark_hit(buffer, cpu, full)) + return true; - if (full) - prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE); - else - prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE); + if (cond(data)) + return true; /* * The events can happen in critical sections where @@ -901,27 +873,75 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) * a task has been queued. It's OK for spurious wake ups. */ if (full) - work->full_waiters_pending = true; + rbwork->full_waiters_pending = true; else - work->waiters_pending = true; + rbwork->waiters_pending = true; - if (rb_watermark_hit(buffer, cpu, full)) - goto out; + return false; +} - if (signal_pending(current)) { - ret = -EINTR; - goto out; +/* + * The default wait condition for ring_buffer_wait() is to just to exit the + * wait loop the first time it is woken up. + */ +static bool rb_wait_once(void *data) +{ + long *once = data; + + /* wait_event() actually calls this twice before scheduling*/ + if (*once > 1) + return true; + + (*once)++; + return false; +} + +/** + * ring_buffer_wait - wait for input to the ring buffer + * @buffer: buffer to wait on + * @cpu: the cpu buffer to
[PATCH v4 2/2] tracing/ring-buffer: Fix wait_on_pipe() race
From: "Steven Rostedt (Google)" When the trace_pipe_raw file is closed, there should be no new readers on the file descriptor. This is mostly handled with the waking and wait_index fields of the iterator. But there's still a slight race. CPU 0 CPU 1 - - wait_index++; index = wait_index; ring_buffer_wake_waiters(); wait_on_pipe() ring_buffer_wait(); The ring_buffer_wait() will miss the wakeup from CPU 1. The problem is that the ring_buffer_wait() needs the logic of: prepare_to_wait(); if (!condition) schedule(); Where the missing condition check is the iter->wait_index update. Have the ring_buffer_wait() take a conditional callback function and a data parameter that can be used within the wait_event_interruptible() of the ring_buffer_wait() function. In wait_on_pipe(), pass a condition function that will check if the wait_index has been updated, if it has, it will return true to break out of the wait_event_interruptible() loop. Create a new field "closed" in the trace_iterator and set it in the .flush() callback before calling ring_buffer_wake_waiters(). This will keep any new readers from waiting on a closed file descriptor. Have the wait_on_pipe() condition callback also check the closed field. Change the wait_index field of the trace_iterator to atomic_t. There's no reason it needs to be 'long' and making it atomic and using atomic_read_acquire() and atomic_fetch_inc_release() will provide the necessary memory barriers. Add a "woken" flag to tracing_buffers_splice_read() to exit the loop after one more try to fetch data. That is, if it waited for data and something woke it up, it should try to collect any new data and then exit back to user space. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsngewhfxzajiaqznwpmqetqmi1waes2o1v6l4c_u...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 ++- include/linux/trace_events.h | 5 - kernel/trace/ring_buffer.c | 13 ++- kernel/trace/trace.c | 43 ++-- 4 files changed, 45 insertions(+), 19 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 338a33db1577..dc5ae4e96aee 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -99,7 +99,8 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k }) typedef bool (*ring_buffer_cond_fn)(void *data); -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full); +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data); __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, struct file *filp, poll_table *poll_table, int full); void ring_buffer_wake_waiters(struct trace_buffer *buffer, int cpu); diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index d68ff9b1247f..fc6d0af56bb1 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -103,13 +103,16 @@ struct trace_iterator { unsigned inttemp_size; char*fmt; /* modified format holder */ unsigned intfmt_size; - longwait_index; + atomic_twait_index; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seqtmp_seq; cpumask_var_t started; + /* Set when the file is closed to prevent new waiters */ + boolclosed; + /* it's true when current open file is snapshot */ boolsnapshot; diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index c198ba466853..67d8405f4451 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -901,23 +901,26 @@ static bool rb_wait_once(void *data) * @buffer: buffer to wait on * @cpu: the cpu buffer to wait on * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS + * @cond: condition function to break out of wait (NULL to run once) + * @data: the data to pass to @cond. * * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon * as data is added to any of the @buffer's cpu buffers. Otherwise * it will wait for data to be added to a specific cpu buffer. */ -int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full) +int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, +ring_buffer_cond_fn cond, void *data) { struct ring_buffer_per_cpu *cpu_buffer; struct wait_queue_hea
[PATCH v4 0/2] tracing/ring-buffer: Fix wakeup of ring buffer waiters
[ Sorry for the spam, but I just noticed kernel-test-robot complained about KernelDoc format, which this fixes ] A patch was sent to "fix" the wait_index variable that is used to help with waking of waiters on the ring buffer. The patch was rejected, but I started looking at associated code. Discussing it on IRC with Mathieu Desnoyers we discovered a design flaw. The waiter reads "wait_index" then enters a "wait loop". After adding itself to the wait queue, if the buffer is filled or a signal is pending it will exit out. If wait_index is different, it will also exit out. (Note, I noticed that the check for wait_index was after the schedule() call and should have been before it, but that's besides the point). The race is what happens if the waker updates the wait_index, a new waiter comes in and reads the updated wait_index before it adds itself to the queue, it will miss the update! These patches fix the design by converting the ring_buffer_wait() to use the wait_event_interruptible() interface. Then the wait_to_pipe() caller will pass its own condition into the ring_buffer_wait() to be checked by the wait_event_interruptible(). - The first patch restructures the ring_buffer_wait() to use the wait_event_interruptible() logic. It does not change the interface, but adds a local "cond" condition callback function that will be what it defaults to in the second patch if NULL is passed in. The default cond function is just a "wait once" function. That is, the first time it is called (before the wait_event_interruptible() calls schedule) will set the "once" variable to one and return false. The next time it is called (after wait_event_interruptible() wakes up) it will return true to break out of the loop. - The second patch changes the ring_buffer_wait() interface to allow the caller to define its own "cond" callback. That will be checked in wait_event_interruptible() along with checking if the proper amount of data in the ring buffer has been hit. Changes since v3: https://lore.kernel.org/all/20240312120252.998983...@goodmis.org/ - I should have checked before sending v2, but after I did, I noticed that kernel-test-robot reported that the cond and data parameters added to ring_buffer_wait() were not added to the kerneldoc above it. Changes since v2: https://lore.kernel.org/all/20240308202402.234176...@goodmis.org/ - Patches 1-3 of v2 have been accepted. - Instead of breaking ring_buffer_wait() into three functions that do a prepare_to_wait(), wait() and finish_wait(), have ring_buffer_wait() take a condition callback function and a data parameter. The ring_buffer_wait() now uses a wait_event_interruptible() code, and added a helper function to do check its own conditions, and also to call the passed in condition function and allow the caller to specify other conditions to break out of the event loop. Changes since v1: https://lore.kernel.org/lkml/20240308183816.676883...@goodmis.org/ - My tests triggered a warning about calling a mutex_lock() after a prepare_to_wait() that changed the task's state. Convert the affected mutex over to a spinlock. Steven Rostedt (Google) (2): ring-buffer: Use wait_event_interruptible() in ring_buffer_wait() tracing/ring-buffer: Fix wait_on_pipe() race include/linux/ring_buffer.h | 4 +- include/linux/trace_events.h | 5 +- kernel/trace/ring_buffer.c | 119 ++- kernel/trace/trace.c | 43 +++- 4 files changed, 109 insertions(+), 62 deletions(-)
[PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic
From: "Steven Rostedt (Google)" The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 21 +++-- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index adfe603a769b..857803e8cf07 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -959,25 +959,18 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, } if (full) { - unsigned long flags; - poll_wait(filp, &rbwork->full_waiters, poll_table); - raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - if (!cpu_buffer->shortest_full || - cpu_buffer->shortest_full > full) - cpu_buffer->shortest_full = full; - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - if (full_hit(buffer, cpu, full)) + if (rb_watermark_hit(buffer, cpu, full)) return EPOLLIN | EPOLLRDNORM; /* * Only allow full_waiters_pending update to be seen after -* the shortest_full is set. If the writer sees the -* full_waiters_pending flag set, it will compare the -* amount in the ring buffer to shortest_full. If the amount -* in the ring buffer is greater than the shortest_full -* percent, it will call the irq_work handler to wake up -* this list. The irq_handler will reset shortest_full +* the shortest_full is set (in rb_watermark_hit). If the +* writer sees the full_waiters_pending flag set, it will +* compare the amount in the ring buffer to shortest_full. +* If the amount in the ring buffer is greater than the +* shortest_full percent, it will call the irq_work handler +* to wake up this list. The irq_handler will reset shortest_full * back to zero. That's done under the reader_lock, but * the below smp_mb() makes sure that the update to * full_waiters_pending doesn't leak up into the above. -- 2.43.0
[PATCH v2 0/2] ring-buffer: Fix poll wakeup logic
After making a slight change to wakeups in ring_buffer_wait() the system would hang. Spending several hours going on a wild goose chase I found that the change only triggered the bug because it changed the timings. The bug was there before the update but never was triggered. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 - - [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] The race was triggered when running: trace-cmd record -p function -m 5000 Which enables function tracing and then creates two files it is writing into where each is 2500K in size. The -m is a "max file size". When trace-cmd writes 2500K to one file it then switches to the other, erasing the old data. To do this, trace-cmd switches between both poll and the reader using both methods of wake up. The change to the reader wakeup was able to change the way the poll was woken to trigger this bug. The second patch is a clean up and also a way to consolidate the logic of the shortest_full. The read wakeup uses rb_watermark_hit for both full wakeups and !full wakeups. But since poll uses the same logic for full wakeups it can just call that function with full set. Changes since v1: https://lore.kernel.org/all/20240312115455.666920...@goodmis.org/ - Removed unused 'flags' in ring_buffer_poll_wait() as the spin_lock is now in rb_watermark_hit(). Steven Rostedt (Google) (2): ring-buffer: Fix full_waiters_pending in poll ring-buffer: Reuse rb_watermark_hit() for the poll logic kernel/trace/ring_buffer.c | 32 +++- 1 file changed, 19 insertions(+), 13 deletions(-)
[PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll
From: "Steven Rostedt (Google)" If a reader of the ring buffer is doing a poll, and waiting for the ring buffer to hit a specific watermark, there could be a case where it gets into an infinite ping-pong loop. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 - - [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] In the poll, the shortest_full needs to be set before the full_pending_waiters, as once that is set, the writer will compare the current shortest_full (which is incorrect) to decide to call the irq_work, which will reset the shortest_full (expecting the readers to update it). Also move the setting of full_waiters_pending after the check if the ring buffer has the required percentage filled. There's no reason to tell the writer to wake up waiters if there are no waiters. Cc: sta...@vger.kernel.org Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 27 --- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index aa332ace108b..adfe603a769b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu, poll_wait(filp, &rbwork->full_waiters, poll_table); raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - } else { - poll_wait(filp, &rbwork->waiters, poll_table); - rbwork->waiters_pending = true; + if (full_hit(buffer, cpu, full)) + return EPOLLIN | EPOLLRDNORM; + /* +* Only allow full_waiters_pending update to be seen after +* the shortest_full is set. If the writer sees the +* full_waiters_pending flag set, it will compare the +* amount in the ring buffer to shortest_full. If the amount +* in the ring buffer is greater than the shortest_full +* percent, it will call the irq_work handler to wake up +* this list. The irq_handler will reset shortest_full +* back to zero. That's done under the reader_lock, but +* the below smp_mb() makes sure that the update to +* full_waiters_pending doesn't leak up into the above. +*/ + smp_mb(); + rbwork->full_waiters_pending = true; + return 0; } + poll_wait(filp, &rbwork->waiters, poll_table); + rbwork->waiters_pending = true; + /* * There's a tight race between setting the waiters_pending and * checking if the ring buffer is empty. Once the waiters_pending bit @@ -989,9 +1005,
Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
Hi Bjorn (apologies, my corporate mail server has butchered your name here). There's a big info dump below; I realise this sounds like a sales pitch for CALL_OPS, but my intent is more to say "here are some dragons you may not have spotted". On Thu, Mar 07, 2024 at 08:27:40PM +0100, Bj"orn T"opel wrote: > Puranjay! > > Puranjay Mohan writes: > > > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. > > This allows each ftrace callsite to provide an ftrace_ops to the common > > ftrace trampoline, allowing each callsite to invoke distinct tracer > > functions without the need to fall back to list processing or to > > allocate custom trampolines for each callsite. This significantly speeds > > up cases where multiple distinct trace functions are used and callsites > > are mostly traced by a single tracer. > > > > The idea and most of the implementation is taken from the ARM64's > > implementation of the same feature. The idea is to place a pointer to > > the ftrace_ops as a literal at a fixed offset from the function entry > > point, which can be recovered by the common ftrace trampoline. > > Not really a review, but some more background; Another rationale (on-top > of the improved per-call performance!) for CALL_OPS was to use it to > build ftrace direct call support (which BPF uses a lot!). Mark, please > correct me if I'm lying here! Yep; it gives us the ability to do a number of per-callsite things, including direct calls. > On Arm64, CALL_OPS makes it possible to implement direct calls, while > only patching one BL instruction -- nice! The key thing here isn't that we patch a single instruction (since we have ot patch the ops pointer too!); it's that we can safely patch either of the ops pointer or BL/NOP at any time while threads are concurrently executing. If you have a multi-instruction sequence, then threads can be preempted mid-sequence, and it's very painful/complex to handle all of the races that entails. For example, if your callsites use a sequence: AUIPC , JALR , () Using stop_machine() won't allow you to patch that safely as some threads could be stuck mid-sequence, e.g. AUIPC , [ preempted here ] JALR , () ... and you can't update the JALR to use a new funcptr immediate until those have completed the sequence. There are ways around that, but they're complicated and/or expensive, e.g. * Use a sequence of multiple patches, starting with replacing the JALR with an exception-generating instruction with a fixup handler, which is sort-of what x86 does with UD2. This may require multiple passes with synchronize_rcu_tasks() to make sure all threads have seen the latest instructions, and that cannot be done under stop_machine(), so if you need stop_machine() for CMODx reasons, you may need to use that several times with intervening calls to synchronize_rcu_tasks(). * Have the patching logic manually go over each thread and fix up the pt_regs for the interrupted thread. This is pretty horrid since you could have nested exceptions and a task could have several pt_regs which might require updating. The CALL_OPS approach is a bit easier to deal with as we can patch the per-callsite pointer atomically, then we can (possibly) enable/disable the callsite's branch, then wait for threads to drain once. As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE generally in this area (CALL_OPs happens to side-step those, but trampoline usage is currently affected): https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/ ... I'm looking into fixing that at the moment, and it looks like that's likely to require some per-architecture changes. > On RISC-V we cannot use use the same ideas as Arm64 straight off, > because the range of jal (compare to BL) is simply too short (+/-1M). > So, on RISC-V we need to use a full auipc/jal pair (the text patching > story is another chapter, but let's leave that aside for now). Since we > have to patch multiple instructions, the cmodx situation doesn't really > improve with CALL_OPS. The branch range thing is annoying, but I think this boils down to the same problem as arm64 has with needing a "MOV , LR" instruction that we have to patch in once at boot time. You could do the same and patch in the AUIPC once, e.g. have | NOP | NOP | func: | AUIPC , | JALR , () // patched with NOP ... which'd look very similar to arm64's sequence: | NOP | NOP | func: | MOV X9, LR | BL ftrace_caller // patched with NOP ... which I think means it *might* be better from a cmodx perspective? > Let's say that we continue building on your patch and implement direct > calls on CALL_OPS for RISC-V as well. > > From Florent's commit message for direct calls: > > |There are a few cases to distinguish: > |- If a direct call ops is the only one tracing a function: > | - If the direct called trampoline is within
Re: [PATCHv2 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge
Hi Pavel, I'm sorry to keep you waiting. On Fri, Feb 23, 2024 at 10:28:49PM +0100, Pavel Machek wrote: > From: Ondrej Jirman > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging > features removed. ANX7688 is rather criticial piece on PinePhone, > there's no display and no battery charging without it. > > There's likely more work to be done here, but having basic support > in mainline is needed to be able to work on the other stuff > (networking, cameras, power management). > > Signed-off-by: Ondrej Jirman > Co-developed-by: Martijn Braam > Co-developed-by: Samuel Holland > Signed-off-by: Pavel Machek > > --- > > v2: Fix checkpatch stuff. Some cleanups, adapt to dts format in 1/2. > > diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig > index 2f80c2792dbd..c9043ae61546 100644 > --- a/drivers/usb/typec/Kconfig > +++ b/drivers/usb/typec/Kconfig > @@ -64,6 +64,17 @@ config TYPEC_ANX7411 > If you choose to build this driver as a dynamically linked module, the > module will be called anx7411.ko. > > +config TYPEC_ANX7688 > + tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver" > + depends on I2C > + depends on USB_ROLE_SWITCH > + help > + Say Y or M here if your system has Analogix ANX7688 Type-C Bridge > + controller driver. > + > + If you choose to build this driver as a dynamically linked module, the > + module will be called anx7688.ko. > + > config TYPEC_RT1719 > tristate "Richtek RT1719 Sink Only Type-C controller driver" > depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH > diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile > index 7a368fea61bc..3f8ff94ad294 100644 > --- a/drivers/usb/typec/Makefile > +++ b/drivers/usb/typec/Makefile > @@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM) += tcpm/ > obj-$(CONFIG_TYPEC_UCSI) += ucsi/ > obj-$(CONFIG_TYPEC_TPS6598X) += tipd/ > obj-$(CONFIG_TYPEC_ANX7411) += anx7411.o > +obj-$(CONFIG_TYPEC_ANX7688) += anx7688.o > obj-$(CONFIG_TYPEC_HD3SS3220)+= hd3ss3220.o > obj-$(CONFIG_TYPEC_STUSB160X)+= stusb160x.o > obj-$(CONFIG_TYPEC_RT1719) += rt1719.o > diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c > new file mode 100644 > index ..14d033bbc0c7 > --- /dev/null > +++ b/drivers/usb/typec/anx7688.c > @@ -0,0 +1,1927 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * ANX7688 USB-C HDMI bridge/PD driver > + * > + * Warning, this driver is somewhat PinePhone specific. > + * > + * How this works: > + * - this driver allows to program firmware into ANX7688 EEPROM, and > + * initialize it > + * - it then communicates with the firmware running on the OCM (on-chip > + * microcontroller) > + * - it detects whether there is cable plugged in or not and powers > + * up or down the ANX7688 based on that > + * - when the cable is connected the firmware on the OCM will handle > + * the detection of the nature of the device on the other end > + * of the USB-C cable > + * - this driver then communicates with the USB phy to let it swap > + * data roles accordingly > + * - it also enables VBUS and VCONN regulators as appropriate > + * - USB phy driver (Allwinner) needs to know whether to switch to > + * device or host mode, or whether to turn off > + * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins > + * or something else via PD, it notifies this driver via software > + * interrupt and this driver will determine how to update the TypeC > + * port status and what input current limit is appropriate > + * - input current limit determination happens 500ms after cable > + * insertion or hard reset (delay is necessary to determine whether > + * the remote end is PD capable or not) > + * - this driver tells to the PMIC driver that the input current limit > + * needs to be changed > + * - this driver also monitors PMIC status and re-sets the input current > + * limit if it changes for some reason (due to PMIC internal decision > + * making) (this is disabled for now) > + * > + * ANX7688 FW behavior as observed: > + * > + * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what > + * you set and send hardcoded PDO_BATT 5-21V 30W message! > + * > + * Product brief: > + * > https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf > + * Development notes: > + * https://xnux.eu/devices/feature/anx7688.html > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* firmware regs */ > + > +#define ANX7688_REG_VBUS_OFF_DELAY_TIME 0x22 > +#define ANX7688_REG_FEATURE_CTRL0x27 > +#define ANX7688_REG_EEPROM_LOAD_STATUS1 0x11 > +#define ANX7688_REG_EEPROM_LOAD_STATUS0 0x12 > +#define ANX7688_REG_FW_VERSION1 0x15 > +#define ANX7688_REG_
Re: [PATCH 3/3] ARM: dts: qcom: Add Sony Xperia Z3 smartphone
On 3/10/24 12:41, Luca Weiss wrote: Add the dts for the Xperia Z3 smartphone which is based on Sony's shinano platform, so at the moment there's little device-specific dts to add on top of the common parts. Signed-off-by: Luca Weiss --- modulo missing makfile, this looks good
Re: [PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll
On Tue, 12 Mar 2024 09:19:20 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > If a reader of the ring buffer is doing a poll, and waiting for the ring > buffer to hit a specific watermark, there could be a case where it gets > into an infinite ping-pong loop. > > The poll code has: > > rbwork->full_waiters_pending = true; > if (!cpu_buffer->shortest_full || > cpu_buffer->shortest_full > full) > cpu_buffer->shortest_full = full; > > The writer will see full_waiters_pending and check if the ring buffer is > filled over the percentage of the shortest_full value. If it is, it calls > an irq_work to wake up all the waiters. > > But the code could get into a circular loop: > > CPU 0 CPU 1 > - - > [ Poll ] >[ shortest_full = 0 ] >rbwork->full_waiters_pending = true; > if (rbwork->full_waiters_pending && > [ buffer percent ] > > shortest_full) { >rbwork->wakeup_full = true; >[ queue_irqwork ] Oh, so `[ buffer percent ] > shortest_full` does not work because if this happens in this order, shortest_full may be 0. > >cpu_buffer->shortest_full = full; > > [ IRQ work ] > if (rbwork->wakeup_full) { > cpu_buffer->shortest_full = 0; > wakeup poll waiters; > [woken] >if ([ buffer percent ] > full) > break; >rbwork->full_waiters_pending = true; > if (rbwork->full_waiters_pending && > [ buffer percent ] > > shortest_full) { >rbwork->wakeup_full = true; >[ queue_irqwork ] > >cpu_buffer->shortest_full = full; > > [ IRQ work ] > if (rbwork->wakeup_full) { > cpu_buffer->shortest_full = 0; > wakeup poll waiters; > [woken] > > [ Wash, rinse, repeat! ] > > In the poll, the shortest_full needs to be set before the > full_pending_waiters, as once that is set, the writer will compare the > current shortest_full (which is incorrect) to decide to call the irq_work, > which will reset the shortest_full (expecting the readers to update it). > > Also move the setting of full_waiters_pending after the check if the ring > buffer has the required percentage filled. There's no reason to tell the > writer to wake up waiters if there are no waiters. > Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thank you, > Cc: sta...@vger.kernel.org > Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 27 --- > 1 file changed, 20 insertions(+), 7 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index aa332ace108b..adfe603a769b 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -964,16 +964,32 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer > *buffer, int cpu, > poll_wait(filp, &rbwork->full_waiters, poll_table); > > raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > - rbwork->full_waiters_pending = true; > if (!cpu_buffer->shortest_full || > cpu_buffer->shortest_full > full) > cpu_buffer->shortest_full = full; > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > - } else { > - poll_wait(filp, &rbwork->waiters, poll_table); > - rbwork->waiters_pending = true; > + if (full_hit(buffer, cpu, full)) > + return EPOLLIN | EPOLLRDNORM; > + /* > + * Only allow full_waiters_pending update to be seen after > + * the shortest_full is set. If the writer sees the > + * full_waiters_pending flag set, it will compare the > + * amount in the ring buffer to shortest_full. If the amount > + * in the ring buffer is greater than the shortest_full > + * percent, it will call the irq_work handler to wake up > + * this list. The irq_handler will reset shortest_full > + * back to zero. That's done under the reader_lock, but > + * the below smp_mb() makes sure that the update to > + * full_waiters_pending doesn't leak up into the above. > + */ > +
[PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check
From: "Steven Rostedt (Google)" The WARN_ON() check in __assign_str() to catch where the source variable to the macro doesn't match the source variable to __string() gives an error in clang: >> include/trace/events/sunrpc.h:703:4: warning: result of comparison against a >> string literal is unspecified (use an explicit string comparison function >> instead) [-Wstring-compare] 670 | __assign_str(progname, "unknown"); That's because the __assign_str() macro has: WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_); Where "src" is a string literal. Clang warns when comparing a string literal directly as it is undefined to what the value of the literal is. Since this is still to make sure the same string that goes to __string() is the same as __assign_str(), for string literals do a test for that and then use strcmp() in those cases Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix tracepoints that save qdisc_dev() as a string") being applied, as this was what found that bug. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/ Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does not match __string()") Signed-off-by: Steven Rostedt (Google) --- include/trace/stages/stage6_event_callback.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/trace/stages/stage6_event_callback.h b/include/trace/stages/stage6_event_callback.h index a0c15f67eabe..83da83a0c14f 100644 --- a/include/trace/stages/stage6_event_callback.h +++ b/include/trace/stages/stage6_event_callback.h @@ -35,7 +35,9 @@ do {\ char *__str__ = __get_str(dst); \ int __len__ = __get_dynamic_array_len(dst) - 1; \ - WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_); \ + WARN_ON_ONCE(__builtin_constant_p(src) ?\ +strcmp((src), __data_offsets.dst##_ptr_) : \ +(src) != __data_offsets.dst##_ptr_); \ memcpy(__str__, __data_offsets.dst##_ptr_ ? : \ EVENT_NULL_STR, __len__);\ __str__[__len__] = '\0';\ -- 2.43.0
Re: [PATCH v2 1/2] ring-buffer: Fix full_waiters_pending in poll
On Wed, 13 Mar 2024 00:22:10 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 12 Mar 2024 09:19:20 -0400 > Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > If a reader of the ring buffer is doing a poll, and waiting for the ring > > buffer to hit a specific watermark, there could be a case where it gets > > into an infinite ping-pong loop. > > > > The poll code has: > > > > rbwork->full_waiters_pending = true; > > if (!cpu_buffer->shortest_full || > > cpu_buffer->shortest_full > full) > > cpu_buffer->shortest_full = full; > > > > The writer will see full_waiters_pending and check if the ring buffer is > > filled over the percentage of the shortest_full value. If it is, it calls > > an irq_work to wake up all the waiters. > > > > But the code could get into a circular loop: > > > > CPU 0 CPU 1 > > - - > > [ Poll ] > >[ shortest_full = 0 ] > >rbwork->full_waiters_pending = true; > > if (rbwork->full_waiters_pending && > > [ buffer percent ] > > > shortest_full) { > > rbwork->wakeup_full = true; > > [ queue_irqwork ] > > Oh, so `[ buffer percent ] > shortest_full` does not work because > if this happens in this order, shortest_full may be 0. Exactly! > > > > >cpu_buffer->shortest_full = full; > > > > [ IRQ work ] > > if (rbwork->wakeup_full) { > > cpu_buffer->shortest_full = 0; And here shortest_full gets set back to zero! (But that's not the bug). > > wakeup poll waiters; > > [woken] > >if ([ buffer percent ] > full) > > break; > >rbwork->full_waiters_pending = true; The bug is setting full_waiters_pending before updating the shortest_full. > > if (rbwork->full_waiters_pending && > > [ buffer percent ] > > > shortest_full) { > > rbwork->wakeup_full = true; > > [ queue_irqwork ] > > > >cpu_buffer->shortest_full = full; > > > > [ IRQ work ] > > if (rbwork->wakeup_full) { > > cpu_buffer->shortest_full = 0; > > wakeup poll waiters; > > [woken] > > > > [ Wash, rinse, repeat! ] > > > > In the poll, the shortest_full needs to be set before the > > full_pending_waiters, as once that is set, the writer will compare the > > current shortest_full (which is incorrect) to decide to call the irq_work, > > which will reset the shortest_full (expecting the readers to update it). > > > > Also move the setting of full_waiters_pending after the check if the ring > > buffer has the required percentage filled. There's no reason to tell the > > writer to wake up waiters if there are no waiters. > > > > Looks good to me. > > Reviewed-by: Masami Hiramatsu (Google) Thanks! I'm running it through my tests and when they finish, I'll be posting the for-linus patches. -- Steve
Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 3/12/24 7:10 AM, Krzysztof Kozlowski wrote: > On 11/03/2024 19:39, Tanmay Shah wrote: > >>> + > >>> +else: > >>> + patternProperties: > >>> +"^r5f@[0-9a-f]+$": > >>> + type: object > >>> + > >>> + properties: > >>> +reg: > >>> + minItems: 1 > >>> + items: > >>> +- description: ATCM internal memory > >>> +- description: BTCM internal memory > >>> + > >>> +reg-names: > >>> + minItems: 1 > >>> + items: > >>> +- const: atcm0 > >>> +- const: btcm0 > >>> + > >>> +power-domains: > >>> + maxItems: 3 > >> > >> Please list power domains. > > > > Hello, > > > > Sent v13 addressing both comments. > > > > And gave me exactly two hours to disagree? I am sorry, I thought I was addressing the right comments. It was minor change tempted me to push new revision, will wait longer next time. > > Best regards, > Krzysztof >
Re: [PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic
On Tue, 12 Mar 2024 09:19:21 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The check for knowing if the poll should wait or not is basically the > exact same logic as rb_watermark_hit(). The only difference is that > rb_watermark_hit() also handles the !full case. But for the full case, the > logic is the same. Just call that instead of duplicating the code in > ring_buffer_poll_wait(). > This changes a bit (e.g. adding pagebusy check) but basically that should be there. And new version appears to be consistent between ring_buffer_wait() and ring_buffer_poll_wait(). So looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thank you, > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 21 +++-- > 1 file changed, 7 insertions(+), 14 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index adfe603a769b..857803e8cf07 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -959,25 +959,18 @@ __poll_t ring_buffer_poll_wait(struct trace_buffer > *buffer, int cpu, > } > > if (full) { > - unsigned long flags; > - > poll_wait(filp, &rbwork->full_waiters, poll_table); > > - raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); > - if (!cpu_buffer->shortest_full || > - cpu_buffer->shortest_full > full) > - cpu_buffer->shortest_full = full; > - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > - if (full_hit(buffer, cpu, full)) > + if (rb_watermark_hit(buffer, cpu, full)) > return EPOLLIN | EPOLLRDNORM; > /* >* Only allow full_waiters_pending update to be seen after > - * the shortest_full is set. If the writer sees the > - * full_waiters_pending flag set, it will compare the > - * amount in the ring buffer to shortest_full. If the amount > - * in the ring buffer is greater than the shortest_full > - * percent, it will call the irq_work handler to wake up > - * this list. The irq_handler will reset shortest_full > + * the shortest_full is set (in rb_watermark_hit). If the > + * writer sees the full_waiters_pending flag set, it will > + * compare the amount in the ring buffer to shortest_full. > + * If the amount in the ring buffer is greater than the > + * shortest_full percent, it will call the irq_work handler > + * to wake up this list. The irq_handler will reset > shortest_full >* back to zero. That's done under the reader_lock, but >* the below smp_mb() makes sure that the update to >* full_waiters_pending doesn't leak up into the above. > -- > 2.43.0 > > > -- Masami Hiramatsu (Google)
Re: [PATCH v2 2/2] ring-buffer: Reuse rb_watermark_hit() for the poll logic
On Wed, 13 Mar 2024 00:38:42 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 12 Mar 2024 09:19:21 -0400 > Steven Rostedt wrote: > > > From: "Steven Rostedt (Google)" > > > > The check for knowing if the poll should wait or not is basically the > > exact same logic as rb_watermark_hit(). The only difference is that > > rb_watermark_hit() also handles the !full case. But for the full case, the > > logic is the same. Just call that instead of duplicating the code in > > ring_buffer_poll_wait(). > > > > This changes a bit (e.g. adding pagebusy check) but basically that should > be there. And new version appears to be consistent between ring_buffer_wait() > and ring_buffer_poll_wait(). So looks good to me. The pagebusy check is an optimization. As if it is true, it means the writer is still on the reader_page and there's no sub-buffers available. It just prevents having to do the calculation of the buffer-percentage filled (what's done by the full_hit() logic). > > Reviewed-by: Masami Hiramatsu (Google) > Thanks! -- Steve
Re: [PATCH v12 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 3/12/24 7:10 AM, Krzysztof Kozlowski wrote: > On 11/03/2024 17:27, Tanmay Shah wrote: > >>> +then: > >>> + patternProperties: > >>> +"^r5f@[0-9a-f]+$": > >>> + type: object > >>> + > >>> + properties: > >>> +reg: > >>> + minItems: 1 > >>> + items: > >>> +- description: ATCM internal memory > >>> +- description: BTCM internal memory > >>> +- description: extra ATCM memory in lockstep mode > >>> +- description: extra BTCM memory in lockstep mode > >>> + > >>> +reg-names: > >>> + minItems: 1 > >>> + items: > >>> +- const: atcm0 > >>> +- const: btcm0 > >>> +- const: atcm1 > >>> +- const: btcm1 > >> > >> Why power domains are flexible? > > > > User may not want to use all the TCMs. For example, if users want to > > turn-on only TCM-A and rest of them want to keep off, then > > > > they can avoid having power-domains of other TCMs in the device-tree. This > > helps with less power-consumption when needed. > > > > Hence flexible list of power-domains list. > > > > Isn't turning on/off driver's job? Sorry, but what is "user" here? DTS > describes bindings, not OS policy. Thanks for reviews. Correct driver turns on off TCM. However, system designers (users) have option to not include TCM that is not needed in device-tree. So power-domains are flexible, same as reg, and reg-names. ATCM is always needed as vector table is in ATCM. R5 core power domain and ATCM power-domain for each core is always required so minItems 2. > Also, please wrap your replies to match email style. > > > I can certainly mention "items:" under power-domains property. > > > > > >> > > > Best regards, > Krzysztof >
[PATCH] ring-buffer: Do not set shortest_full when full target is hit
From: "Steven Rostedt (Google)" The rb_watermark_hit() checks if the amount of data in the ring buffer is above the percentage level passed in by the "full" variable. If it is, it returns true. But it also sets the "shortest_full" field of the cpu_buffer that informs writers that it needs to call the irq_work if the amount of data on the ring buffer is above the requested amount. The rb_watermark_hit() always sets the shortest_full even if the amount in the ring buffer is what it wants. As it is not going to wait, because it has what it wants, there's no reason to set shortest_full. Cc: sta...@vger.kernel.org Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 9b887d44b8d9..350607cce869 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer *buffer, int cpu, int full) pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; ret = !pagebusy && full_hit(buffer, cpu, full); - if (!cpu_buffer->shortest_full || - cpu_buffer->shortest_full > full) - cpu_buffer->shortest_full = full; + if (!ret && (!cpu_buffer->shortest_full || +cpu_buffer->shortest_full > full)) { + cpu_buffer->shortest_full = full; + } raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); } return ret; -- 2.43.0
Re: [PATCH] ring-buffer: Do not set shortest_full when full target is hit
On Tue, 12 Mar 2024 11:56:41 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The rb_watermark_hit() checks if the amount of data in the ring buffer is > above the percentage level passed in by the "full" variable. If it is, it > returns true. > > But it also sets the "shortest_full" field of the cpu_buffer that informs > writers that it needs to call the irq_work if the amount of data on the > ring buffer is above the requested amount. > > The rb_watermark_hit() always sets the shortest_full even if the amount in > the ring buffer is what it wants. As it is not going to wait, because it > has what it wants, there's no reason to set shortest_full. > Yeah, it should avoid setting if !ret. Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thank you, > Cc: sta...@vger.kernel.org > Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 9b887d44b8d9..350607cce869 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -834,9 +834,10 @@ static bool rb_watermark_hit(struct trace_buffer > *buffer, int cpu, int full) > pagebusy = cpu_buffer->reader_page == cpu_buffer->commit_page; > ret = !pagebusy && full_hit(buffer, cpu, full); > > - if (!cpu_buffer->shortest_full || > - cpu_buffer->shortest_full > full) > - cpu_buffer->shortest_full = full; > + if (!ret && (!cpu_buffer->shortest_full || > + cpu_buffer->shortest_full > full)) { > + cpu_buffer->shortest_full = full; > + } > raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); > } > return ret; > -- > 2.43.0 > -- Masami Hiramatsu (Google)
Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver
On Mon, 11 Mar 2024 at 05:44, Abdellatif El Khlifi wrote: > > Hi Mathieu, > > On Fri, Mar 08, 2024 at 09:44:26AM -0700, Mathieu Poirier wrote: > > On Thu, 7 Mar 2024 at 12:40, Abdellatif El Khlifi > > wrote: > > > > > > Hi Mathieu, > > > > > > > > + do { > > > > > + state_reg = readl(priv->reset_cfg.state_reg); > > > > > + *rst_ack = EXTSYS_RST_ST_RST_ACK(state_reg); > > > > > + > > > > > + if (*rst_ack == EXTSYS_RST_ACK_RESERVED) { > > > > > + dev_err(dev, "unexpected RST_ACK value: 0x%x\n", > > > > > + *rst_ack); > > > > > + return -EINVAL; > > > > > + } > > > > > + > > > > > + /* expected ACK value read */ > > > > > + if ((*rst_ack & exp_ack) || (*rst_ack == exp_ack)) > > > > > > > > I'm not sure why the second condition in this if() statement is needed. > > > > As far > > > > as I can tell the first condition will trigger and the second one won't > > > > be > > > > reached. > > > > > > The second condition takes care of the following: exp_ack and *rst_ack > > > are both 0. > > > This case happens when RST_REQ bit is cleared (meaning: No reset > > > requested) and > > > we expect the RST_ACK to be 00 afterwards. > > > > > > > This is the kind of conditions that definitely deserve documentation. > > Please split the conditions in two different if() statements and add a > > comment to explain what is going on. > > Thanks, I'll address that. > > > > > > > > +/** > > > > > + * arm_rproc_load() - Load firmware to memory function for rproc_ops > > > > > + * @rproc: pointer to the remote processor object > > > > > + * @fw: pointer to the firmware > > > > > + * > > > > > + * Does nothing currently. > > > > > + * > > > > > + * Return: > > > > > + * > > > > > + * 0 for success. > > > > > + */ > > > > > +static int arm_rproc_load(struct rproc *rproc, const struct firmware > > > > > *fw) > > > > > +{ > > > > > > > > What is the point of doing rproc_of_parse_firmware() if the firmware > > > > image is > > > > not loaded to memory? Does the remote processor have some kind of > > > > default ROM > > > > image to run if it doesn't find anything in memory? > > > > > > Yes, the remote processor has a default FW image already loaded by > > > default. > > > > > > > That too would have mandated a comment - otherwise people looking at > > the code are left wondering, as I did. > > > > > rproc_boot() [1] and _request_firmware() [2] fail if there is no FW file > > > in the filesystem or a filename > > > provided. > > > > > > Please correct me if I'm wrong. > > > > You are correct, the remoteproc subsystem expects a firmware image to > > be provided _and_ loaded into memory. Providing a dummy image just to > > get the remote processor booted is a hack, but simply because the > > subsystem isn't tailored to handle this use case. So I am left > > wondering what the plans are for this driver, i.e is this a real > > scenario that needs to be addressed or just an initial patchset to get > > a foundation for the driver. > > > > In the former case we need to start talking about refactoring the > > subsystem so that it properly handles remote processors that don't > > need a firmware image. In the latter case I'd rather see a patchset > > where the firmware image is loaded into RAM. > > This is an initial patchset for allowing to turn on and off the remote > processor. > The FW is already loaded before the Corstone-1000 SoC is powered on and this > is done through the FPGA board bootloader in case of the FPGA target. Or by > the Corstone-1000 FVP model > (emulator). > >From the above I take it that booting with a preloaded firmware is a scenario that needs to be supported and not just a temporary stage. > The plan for the driver is as follows: > > Step 1: provide a foundation driver capable of turning the core on/off > > Step 2: provide mailbox support for comms > > Step 3: provide FW reload capability > What happens when a user wants to boot the remote processor with the firmware provided on the file system rather than the one preloaded into memory? Furthermore, how do we account for scenarios where the remote processor goes from running a firmware image on the file system to the firmware image loaded by an external entity? Is this a valid scenario? > Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) can > share memory with > the remote core. > > I'm happy to provide more explanation in the commit log to reflect this > status. > > Is it OK that we go with step 1 as a foundation please ? > First let's clarify all the scenarios that need to be supported. From there I will advise on how to proceed and what modifications to the subsystem's core should be made, if need be. > Cheers > Abdellatif
Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes
On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote: > On 08.03.24 13:33, David Woodhouse wrote: > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote: > > > On 07.03.24 15:02, David Woodhouse wrote: > > > > Hm, should we allow UTC? If you tell me the time in UTC, then > > > > (sometimes) I still don't actually know what the time is, because some > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI > > > > offset, surely? Should the virtio_rtc specification make it mandatory > > > > to provide such? > > > > > > > > Otherwise you're just designing it to allow crappy hypervisors to > > > > expose incomplete information. > > > > > > > > > > Hi David, > > > > > > (adding virtio-comm...@lists.oasis-open.org for spec discussion), > > > > > > thank you for your insightful comments. I think I take a broadly similar > > > view. The reason why the current spec and driver is like this is that I > > > took a pragmatic approach at first and only included features which work > > > out-of-the-box for the current Linux ecosystem. > > > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore > > > can work out-of-the-box with time sync daemons such as chrony. > > > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock > > > as well, I am afraid that > > > > > > - in some (embedded) scenarios, the TAI clock may not be available > > > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock. > > > > > > For the same reasons, I am also not sure about adding a *mandatory* TAI > > > offset to each readout. I don't know user-space software which would > > > leverage this already (at least not through the PTP clock interface). > > > And why would such software not go straight for the TAI clock instead? > > > > > > How about adding a requirement to the spec that the virtio-rtc device > > > SHOULD expose the TAI clock whenever it is available - would this > > > address your concerns? > > > > I think that would be too easy for implementors to miss, or decide not > > to obey. Or to get *wrong*, by exposing a TAI clock but actually > > putting UTC in it. > > > > I think I prefer to mandate the tai_offset field with the UTC clock. > > Crappy implementations will just set it to zero, but at least that > > gives a clear signal to the guests that it's *their* problem to > > resolve. > > To me there are some open questions regarding how this would work. Is there > a use case for this with the v3 clock reading methods, or would it be > enough to address this with the Virtio timekeeper? > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably > best alongside some additional information about leap seconds. I am not > aware about any user-space user. In addition, leap second smearing should > also be addressed. > Is there even a standard yet for leap-smearing? Will it be linear over 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I think is what Google does? Meta does something different again, don't they? Exposing UTC as the only clock reference is bad enough; when leap seconds happen there's a whole second during which you don't *know* which second it is. It seems odd to me, for a precision clock to be deliberately ambiguous about what the time is! But if the virtio-rtc clock is defined as UTC and then expose something *different* in it, that's even worse. You potentially end up providing inaccurate time for a whole *day* leading up to the leap second. I think you're right that leap second smearing should be addressed. At the very least, by making it clear that the virtio-rtc clock which advertises UTC shall be used *only* for UTC, never UTC-SLS or any other yet-to-be-defined variant. Please make it explicit that any hypervisor which wants to advertise a smeared clock shall define a new type which specifies the precise smearing algorithm and cannot be conflated with the one you're defining here. > > One other thing to note is I think we're being very naïve about the TSC > > on x86 hosts. Theoretically, the TSC for every vCPU might run at a > > different frequency, and even if they run at the same frequency they > > might be offset from each other. I'm happy to be naïve but I think we > > should be *explicitly* so, and just say for example that it's defined > > against vCPU0 so if other vCPUs are different then all bets are off. > > ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you > have an opinion on how to represent this in a platform-independent way. Well, it doesn't have a notion of TSCs either; you include that by implicit reference don't you? smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver
Hi Mathieu, On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote: > > This is an initial patchset for allowing to turn on and off the remote > > processor. > > The FW is already loaded before the Corstone-1000 SoC is powered on and this > > is done through the FPGA board bootloader in case of the FPGA target. Or by > > the Corstone-1000 FVP model > > (emulator). > > > >From the above I take it that booting with a preloaded firmware is a > scenario that needs to be supported and not just a temporary stage. The current status of the Corstone-1000 SoC requires that there is a preloaded firmware for the external core. Preloading is done externally either through the FPGA bootloader or the emulator (FVP) before powering on the SoC. Corstone-1000 will be upgraded in a way that the A core running Linux is able to share memory with the remote core and also being able to access the remote core memory so Linux can copy the firmware to. This HW changes are still under development. This is why this patchset is relying on a preloaded firmware. And it's the step 1 of adding remoteproc support for Corstone. When the HW is ready, we will be able to avoid preloading the firmware and the user can do the following: 1) Use a default firmware filename stated in the DT (firmware-name property), that's the one remoteproc subsystem will use initially, load the firmware file and start the remote core. 2) Then, the user can choose to use another firmware file: echo stop >/sys/class/remoteproc/remoteproc0/state echo -n new_firmware.elf > /sys/class/remoteproc/remoteproc0/firmware echo start >/sys/class/remoteproc/remoteproc0/state > > The plan for the driver is as follows: > > > > Step 1: provide a foundation driver capable of turning the core on/off > > > > Step 2: provide mailbox support for comms > > > > Step 3: provide FW reload capability > > > What happens when a user wants to boot the remote processor with the > firmware provided on the file system rather than the one preloaded > into memory? We will support this scenario when the HW is upgraded and copying the firmware to the remote core memory becomes possible. > Furthermore, how do we account for scenarios where the > remote processor goes from running a firmware image on the file system > to the firmware image loaded by an external entity? Is this a valid > scenario? No, this scenario won't apply when we get the HW upgrade. No need for an external entity anymore. The firmware(s) will all be files in the linux filesystem. > > Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) > > can share memory with > > the remote core. > > > > I'm happy to provide more explanation in the commit log to reflect this > > status. > > > > Is it OK that we go with step 1 as a foundation please ? > > > > First let's clarify all the scenarios that need to be supported. From > there I will advise on how to proceed and what modifications to the > subsystem's core should be made, if need be. Thanks, I hope the answers above provide the information needed. Cheers Abdellatif
Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 3/12/24 7:13 AM, Krzysztof Kozlowski wrote: > On 11/03/2024 18:59, Tanmay Shah wrote: >> From: Radhey Shyam Pandey >> >> Introduce bindings for TCM memory address space on AMD-xilinx Zynq >> UltraScale+ platform. It will help in defining TCM in device-tree >> and make it's access platform agnostic and data-driven. >> >> Tightly-coupled memories(TCMs) are low-latency memory that provides >> predictable instruction execution and predictable data load/store >> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory >> banks on the ATCM and BTCM ports, for a total of 128 KB of memory. >> >> The TCM resources(reg, reg-names and power-domain) are documented for >> each TCM in the R5 node. The reg and reg-names are made as required >> properties as we don't want to hardcode TCM addresses for future >> platforms and for zu+ legacy implementation will ensure that the >> old dts w/o reg/reg-names works and stable ABI is maintained. >> >> It also extends the examples for TCM split and lockstep modes. >> >> Signed-off-by: Radhey Shyam Pandey >> Signed-off-by: Tanmay Shah >> --- >> >> Changes in v13: >> - Have power-domains property for lockstep case instead of >> keeping it flexible. >> - Add "items:" list in power-domains property > > > Reviewed-by: Krzysztof Kozlowski Hi Krzysztof, Thanks for RB. I provided explanation of flexible power-domains in previous patchset. I am happy to send new revision removing minItems if you dis-agree. Thanks. > > Best regards, > Krzysztof >
Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 12/03/2024 18:42, Tanmay Shah wrote: > > > On 3/12/24 7:13 AM, Krzysztof Kozlowski wrote: >> On 11/03/2024 18:59, Tanmay Shah wrote: >>> From: Radhey Shyam Pandey >>> >>> Introduce bindings for TCM memory address space on AMD-xilinx Zynq >>> UltraScale+ platform. It will help in defining TCM in device-tree >>> and make it's access platform agnostic and data-driven. >>> >>> Tightly-coupled memories(TCMs) are low-latency memory that provides >>> predictable instruction execution and predictable data load/store >>> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory >>> banks on the ATCM and BTCM ports, for a total of 128 KB of memory. >>> >>> The TCM resources(reg, reg-names and power-domain) are documented for >>> each TCM in the R5 node. The reg and reg-names are made as required >>> properties as we don't want to hardcode TCM addresses for future >>> platforms and for zu+ legacy implementation will ensure that the >>> old dts w/o reg/reg-names works and stable ABI is maintained. >>> >>> It also extends the examples for TCM split and lockstep modes. >>> >>> Signed-off-by: Radhey Shyam Pandey >>> Signed-off-by: Tanmay Shah >>> --- >>> >>> Changes in v13: >>> - Have power-domains property for lockstep case instead of >>> keeping it flexible. >>> - Add "items:" list in power-domains property >> >> >> Reviewed-by: Krzysztof Kozlowski > > Hi Krzysztof, > > Thanks for RB. I provided explanation of flexible power-domains in > previous patchset. I am happy to send new revision removing > minItems if you dis-agree. Thanks for the explanation, it sounds fine, thus patch LGTM. Best regards, Krzysztof
Re: [PATCH 1/3] ARM: dts: qcom: msm8974-sony-castor: Split into shinano-common
On 3/10/24 12:41, Luca Weiss wrote: In preparation for adding the Sony Xperia Z3 smartphone, split the common parts into shinano-common.dtsi. No functional change intended. Signed-off-by: Luca Weiss --- I give you the green light to also add newlines between subsequent subnodes (e.g. under blsp2_i2c5) while at it ;) Reviewed-by: Konrad Dybcio Konrad
Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem
Vishal Verma wrote: > The dax driver incorrectly used driver-core device locks to protect > internal dax region and dax device configuration structures. Replace the > device lock usage with a local rwsem, one each for dax region > configuration and dax device configuration. As a result of this > conversion, no device_lock() usage remains in dax/bus.c. > > Cc: Dan Williams > Reported-by: Greg Kroah-Hartman > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 220 > ++ > 1 file changed, 157 insertions(+), 63 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 1ff1ab5fa105..cb148f74ceda 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -12,6 +12,18 @@ > > static DEFINE_MUTEX(dax_bus_lock); > > +/* > + * All changes to the dax region configuration occur with this lock held > + * for write. > + */ > +DECLARE_RWSEM(dax_region_rwsem); > + > +/* > + * All changes to the dax device configuration occur with this lock held > + * for write. > + */ > +DECLARE_RWSEM(dax_dev_rwsem); > + > #define DAX_NAME_LEN 30 > struct dax_id { > struct list_head list; > @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax) > u64 size = 0; > int i; > > - device_lock_assert(&dev_dax->dev); > + WARN_ON_ONCE(!rwsem_is_locked(&dax_dev_rwsem)); Apologies for the late review, but... All of these WARN_ON_ONCE() usages should be replaced with lockdep_assert_held() and lockdep_assert_held_write() where appropriate.
Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem
On Tue, 2024-03-12 at 13:07 -0700, Dan Williams wrote: > Vishal Verma wrote: > > The dax driver incorrectly used driver-core device locks to protect > > internal dax region and dax device configuration structures. Replace the > > device lock usage with a local rwsem, one each for dax region > > configuration and dax device configuration. As a result of this > > conversion, no device_lock() usage remains in dax/bus.c. > > > > Cc: Dan Williams > > Reported-by: Greg Kroah-Hartman > > Signed-off-by: Vishal Verma > > --- > > drivers/dax/bus.c | 220 > > ++ > > 1 file changed, 157 insertions(+), 63 deletions(-) > > > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > > index 1ff1ab5fa105..cb148f74ceda 100644 > > --- a/drivers/dax/bus.c > > +++ b/drivers/dax/bus.c > > @@ -12,6 +12,18 @@ > > > > static DEFINE_MUTEX(dax_bus_lock); > > > > +/* > > + * All changes to the dax region configuration occur with this lock held > > + * for write. > > + */ > > +DECLARE_RWSEM(dax_region_rwsem); > > + > > +/* > > + * All changes to the dax device configuration occur with this lock held > > + * for write. > > + */ > > +DECLARE_RWSEM(dax_dev_rwsem); > > + > > #define DAX_NAME_LEN 30 > > struct dax_id { > > struct list_head list; > > @@ -180,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax) > > u64 size = 0; > > int i; > > > > - device_lock_assert(&dev_dax->dev); > > + WARN_ON_ONCE(!rwsem_is_locked(&dax_dev_rwsem)); > > Apologies for the late review, but... > > All of these WARN_ON_ONCE() usages should be replaced with > lockdep_assert_held() and lockdep_assert_held_write() where appropriate. Makes sense - I can send a patch post -rc1 to change these if that's okay Andrew?
Re: [PATCH v7 1/5] dax/bus.c: replace driver-core lock usage by a local rwsem
On Tue, 12 Mar 2024 20:20:17 + "Verma, Vishal L" wrote: > > All of these WARN_ON_ONCE() usages should be replaced with > > lockdep_assert_held() and lockdep_assert_held_write() where appropriate. > > Makes sense - I can send a patch post -rc1 to change these if that's okay > Andrew? Please do.
[PATCH bpf-next 0/3] uprobes: two common case speed ups
This patch set implements two speed ups for uprobe/uretprobe runtime execution path for some common scenarios: BPF-only uprobes (patches #1 and #2) and system-wide (non-PID-specific) uprobes (patch #3). Please see individual patches for details. Given I haven't worked with uprobe code before, I'm unfamiliar with conventions in this subsystem, including which kernel tree patches should be sent to. For now I based all the changes on top of bpf-next/master, which is where I tested and benchmarked everything anyways. Please advise what should I use as a base for subsequent revision. Thanks. Andrii Nakryiko (3): uprobes: encapsulate preparation of uprobe args buffer uprobes: prepare uprobe args buffer lazily uprobes: add speculative lockless system-wide uprobe filter check kernel/trace/trace_uprobe.c | 103 ++-- 1 file changed, 63 insertions(+), 40 deletions(-) -- 2.43.0
[PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer
Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 75 - 1 file changed, 41 insertions(+), 34 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index a84b85d8aac1..a0f60bb10158 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = { struct uprobe_cpu_buffer { struct mutex mutex; void *buf; + int dsize; }; static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer; static int uprobe_buffer_refcnt; @@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb) mutex_unlock(&ucb->mutex); } +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, + struct pt_regs *regs) +{ + struct uprobe_cpu_buffer *ucb; + int dsize, esize; + + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); + dsize = __get_data_size(&tu->tp, regs); + + ucb = uprobe_buffer_get(); + ucb->dsize = dsize; + + store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize); + + return ucb; +} + static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize, + struct uprobe_cpu_buffer *ucb, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; @@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, WARN_ON(call != trace_file->event_call); - if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE)) + if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE)) return; if (trace_trigger_soft_disabled(trace_file)) return; esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); - size = esize + tu->tp.size + dsize; + size = esize + tu->tp.size + ucb->dsize; entry = trace_event_buffer_reserve(&fbuffer, trace_file, size); if (!entry) return; @@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, data = DATAOF_TRACE_ENTRY(entry, false); } - memcpy(data, ucb->buf, tu->tp.size + dsize); + memcpy(data, ucb->buf, tu->tp.size + ucb->dsize); trace_event_buffer_commit(&fbuffer); } /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; @@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, rcu_read_lock(); trace_probe_for_each_link_rcu(link, &tu->tp) - __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, 0, regs, ucb, link->file); rcu_read_unlock(); return 0; @@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; rcu_read_lock(); trace_probe_for_each_link_rcu(link, &tu->tp) - __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, func, regs, ucb, link->file); rcu_read_unlock(); } @@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, static void __uprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize) + struct uprobe_cpu_buffer *ucb) { struct trace_event_call *call = trace_probe_event_call(&tu->tp); struct uprobe_trace_entry_head *entry; @@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); - size = esize + tu->tp.size + d
[PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily
uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes uprobe-nop :2.657 ± 0.024M/s uprobe-push:2.499 ± 0.018M/s uprobe-ret :1.100 ± 0.006M/s uretprobe-nop :1.356 ± 0.004M/s uretprobe-push :1.317 ± 0.019M/s uretprobe-ret :0.785 ± 0.007M/s AFTER these changes === uprobe-nop :2.732 ± 0.022M/s (+2.8%) uprobe-push:2.621 ± 0.016M/s (+4.9%) uprobe-ret :1.105 ± 0.007M/s (+0.5%) uretprobe-nop :1.396 ± 0.007M/s (+2.9%) uretprobe-push :1.347 ± 0.008M/s (+2.3%) uretprobe-ret :0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 56 ++--- 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index a0f60bb10158..f2875349d124 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -963,15 +963,22 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, + struct uprobe_cpu_buffer **ucbp, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; struct trace_event_buffer fbuffer; + struct uprobe_cpu_buffer *ucb; void *data; int size, esize; struct trace_event_call *call = trace_probe_event_call(&tu->tp); + ucb = *ucbp; + if (!ucb) { + ucb = prepare_uprobe_buffer(tu, regs); + *ucbp = ucb; + } + WARN_ON(call != trace_file->event_call); if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE)) @@ -1002,7 +1009,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb) +struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; @@ -1011,7 +1018,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, rcu_read_lock(); trace_probe_for_each_link_rcu(link, &tu->tp) - __uprobe_trace_func(tu, 0, regs, ucb, link->file); + __uprobe_trace_func(tu, 0, regs, ucbp, link->file); rcu_read_unlock(); return 0; @@ -1019,13 +1026,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb) +struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; rcu_read_lock(); trace_probe_for_each_link_rcu(link, &tu->tp) - __uprobe_trace_func(tu, func, regs, ucb, link->file); + __uprobe_trace_func(tu, func, regs, ucbp, link-
[PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch adds a speculative check before grabbing that rwlock. If nr_systemwide is non-zero, lock is skipped and event is passed through. >From examining existing logic it looks correct and safe to do. If nr_systemwide is being modified under rwlock in parallel, we have to consider basically just one important race condition: the case when nr_systemwide is dropped from one to zero (from trace_uprobe_filter_remove()) under filter->rwlock, but uprobe_perf_filter() raced and saw it as >0. In this case, we'll proceed with uprobe/uretprobe execution, while uprobe_perf_close() and uprobe_apply() will be blocked on trying to grab uprobe->register_rwsem as a writer. It will be blocked because uprobe_dispatcher() (and, similarly, uretprobe_dispatcher()) runs with uprobe->register_rwsem taken as a reader. So there is no real race besides uprobe/uretprobe might execute one last time before it's removed, which is fine because from user space perspective uprobe/uretprobe hasn't been yet deactivated. In case we speculatively read nr_systemwide as zero, while it was incremented in parallel, we'll proceed to grabbing filter->rwlock and re-doing the check, this time in lock-protected and non-racy way. As such, it looks safe to do a quick short circuiting check and save some performance in a very common system-wide case, not sacrificing hot path performance due to much rarer possibility of registration or unregistration of uprobes. Again, confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) === uprobe-nop :2.732 ± 0.022M/s uprobe-push:2.621 ± 0.016M/s uprobe-ret :1.105 ± 0.007M/s uretprobe-nop :1.396 ± 0.007M/s uretprobe-push :1.347 ± 0.008M/s uretprobe-ret :0.800 ± 0.006M/s AFTER = uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop :1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret :0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index f2875349d124..be28e6d0578e 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, tu = container_of(uc, struct trace_uprobe, consumer); filter = tu->tp.event->filter; + /* speculative check */ + if (READ_ONCE(filter->nr_systemwide)) + return true; + read_lock(&filter->rwlock); ret = __uprobe_perf_filter(filter, mm); read_unlock(&filter->rwlock); -- 2.43.0
[PATCH v4] ring-buffer: Have mmapped ring buffer keep track of missed events
From: "Steven Rostedt (Google)" While testing libtracefs on the mmapped ring buffer, the test that checks if missed events are accounted for failed when using the mapped buffer. This is because the mapped page does not update the missed events that were dropped because the writer filled up the ring buffer before the reader could catch it. Add the missed events to the reader page/sub-buffer when the IOCTL is done and a new reader page is acquired. Note that all accesses to the reader_page via rb_page_commit() had to be switched to rb_page_size(), and rb_page_size() which was just a copy of rb_page_commit() but now it masks out the RB_MISSED bits. This is needed as the mapped reader page is still active in the ring buffer code and where it reads the commit field of the bpage for the size, it now must mask it otherwise the missed bits that are now set will corrupt the size returned. Signed-off-by: Steven Rostedt (Google) --- Changes since v3: https://lore.kernel.org/linux-trace-kernel/20240109212702.5b8e6...@gandalf.local.home - Almost forgot to add this, but one of my libtracefs tests failed due to it. - Rebased against the latest code. - Removed the zeroing out at the end of the page as the ring buffer is now zeroed out when allocated. Not too worried about stale tracing data, but having non tracing data would be a problem. kernel/trace/ring_buffer.c | 53 +- 1 file changed, 47 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 13250856a3a8..abe21c47fb49 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -313,6 +313,8 @@ static u64 rb_event_time_stamp(struct ring_buffer_event *event) /* Missed count stored at end */ #define RB_MISSED_STORED (1 << 30) +#define RB_MISSED_MASK (3 << 30) + struct buffer_data_page { u64 time_stamp;/* page time stamp */ local_t commit;/* write committed index */ @@ -2274,7 +2276,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter) /* Size is determined by what has been committed */ static __always_inline unsigned rb_page_size(struct buffer_page *bpage) { - return rb_page_commit(bpage); + return rb_page_commit(bpage) & ~RB_MISSED_MASK; } static __always_inline unsigned @@ -3901,7 +3903,7 @@ static bool rb_per_cpu_empty(struct ring_buffer_per_cpu *cpu_buffer) return true; /* Reader should exhaust content in reader page */ - if (reader->read != rb_page_commit(reader)) + if (reader->read != rb_page_size(reader)) return false; /* @@ -4372,7 +4374,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) return ((iter->head_page == commit_page && iter->head >= commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && -iter->head == rb_page_commit(cpu_buffer->reader_page))); +iter->head == rb_page_size(cpu_buffer->reader_page))); } EXPORT_SYMBOL_GPL(ring_buffer_iter_empty); @@ -5701,7 +5703,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, event = rb_reader_event(cpu_buffer); read = reader->read; - commit = rb_page_commit(reader); + commit = rb_page_size(reader); /* Check if any events were dropped */ missed_events = cpu_buffer->lost_events; @@ -5778,7 +5780,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, } else { /* update the entry counter */ cpu_buffer->read += rb_page_entries(reader); - cpu_buffer->read_bytes += rb_page_commit(reader); + cpu_buffer->read_bytes += rb_page_size(reader); /* swap the pages */ rb_init_page(bpage); @@ -6331,6 +6333,8 @@ struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) { struct ring_buffer_per_cpu *cpu_buffer; + struct buffer_page *reader; + unsigned long missed_events; unsigned long reader_size; unsigned long flags; @@ -6356,9 +6360,46 @@ int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) goto out; } - if (WARN_ON(!rb_get_reader_page(cpu_buffer))) + reader = rb_get_reader_page(cpu_buffer); + if (WARN_ON(!reader)) goto out; + /* Check if any events were dropped */ + missed_events = cpu_buffer->lost_events; + + if (cpu_buffer->reader_page != cpu_buffer->commit_page) { + if (missed_events) { + struct buffer_data_page *bpage = reader->page; + unsigned int commit; + /* +* Use the real_end for the data size, +* This gives us a chance to
Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()
On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote: > On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote: > > > > While you're at it, if you want to try it, you could see if you can > > improve the situation more by looking at symbol_get() users that remain > > and seeing if you can instead fix it with proper Kconfig dependency and > > at build time. Then we can just remove it as well. > > > > Luis > > Sorry for the late reply. > > Luis, can you give more details of your idea? I re-read it once, then > came back and still don't understand. > > I see that there are ~10 users for symbol_get() currently. Do you want > to stringify symbol names at build time to completely remove > symbol_get() from module.h? Correct me if I'm wrong since using of a > fuction which is not declared anywhere sounds confusing. As an example look at the code and see if there's a sensible way to make some calls built-in instead of part of the module, then the module can have a kconfig builtin option, that adds to the built-in code which means you don't need the symbol_get(). For some other pieces of code it may require other strategies. Luis
Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()
On Tue, Mar 12, 2024 at 03:25:27PM -0700, Luis Chamberlain wrote: > On Tue, Feb 13, 2024 at 02:10:45PM +0300, Andrew Kanner wrote: > > On Thu, Feb 01, 2024 at 10:13:54AM -0800, Luis Chamberlain wrote: > > > > > > While you're at it, if you want to try it, you could see if you can > > > improve the situation more by looking at symbol_get() users that remain > > > and seeing if you can instead fix it with proper Kconfig dependency and > > > at build time. Then we can just remove it as well. > > > > > > Luis > > > > Sorry for the late reply. > > > > Luis, can you give more details of your idea? I re-read it once, then > > came back and still don't understand. > > > > I see that there are ~10 users for symbol_get() currently. Do you want > > to stringify symbol names at build time to completely remove > > symbol_get() from module.h? Correct me if I'm wrong since using of a > > fuction which is not declared anywhere sounds confusing. > > As an example look at the code and see if there's a sensible way to make > some calls built-in instead of part of the module, then the module can > have a kconfig builtin option, that adds to the built-in code which > means you don't need the symbol_get(). > > For some other pieces of code it may require other strategies. An example is FW_LOADER_USER_HELPER which is bool only, and is selected by users. It didn't use symbol_get() before, however its an example of how through Kconfig you can align requirements and define built-in components, even if they do come from a module. Luis
[GIT PULL] Modules changes for v6.9-rc1
The following changes since commit 41bccc98fb7931d63d03f326a746ac4d429c1dd3: Linux 6.8-rc2 (2024-01-28 17:01:12 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/modules-6.9-rc1 for you to fetch changes up to d1909c0221739356f31c721de4743e7d219a56cc: module: Don't ignore errors from set_memory_XX() (2024-02-16 11:30:43 -0800) Modules changes for v6.9-rc1 Christophe Leroy did most of the work on this release, first with a few cleanups on CONFIG_STRICT_KERNEL_RWX and ending with error handling for when set_memory_XX() can fail. This is part of a larger effort to clean up all these callers which can fail, modules is just part of it. This has been sitting on linux-next for about a month without issues. Christophe Leroy (6): module: Use set_memory_rox() module: Change module_enable_{nx/x/ro}() to more explicit names init: Declare rodata_enabled and mark_rodata_ro() at all time modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled powerpc: Simplify strict_kernel_rwx_enabled() module: Don't ignore errors from set_memory_XX() Randy Dunlap (1): lib/test_kmod: fix kernel-doc warnings arch/powerpc/include/asm/mmu.h | 9 +- include/linux/init.h | 4 --- init/main.c| 21 +- kernel/module/internal.h | 6 ++-- kernel/module/main.c | 20 +++--- kernel/module/strict_rwx.c | 63 +++--- lib/test_kmod.c| 6 +++- 7 files changed, 73 insertions(+), 56 deletions(-)