[PATCH] lib: spinlock_debug: Avoid livelock in do_raw_spin_lock
The logic in do_raw_spin_lock attempts to acquire a spinlock by invoking arch_spin_trylock in a loop with a delay between each attempt. Now consider the following situation in a 2 CPU system: 1. CPU-0 continually acquires and releases a spinlock in a tight loop; it stays in this loop until some condition X is satisfied. X can only be satisfied by another CPU. 2. CPU-1 tries to acquire the same spinlock, in an attempt to satisfy the aforementioned condition X. However, it never sees the unlocked value of the lock because the debug spinlock code uses trylock instead of just lock; it checks at all the wrong moments - whenever CPU-0 has locked the lock. Now in the absence of debug spinlocks, the architecture specific spinlock code can correctly allow CPU-1 to wait in a "queue" (e.g., ticket spinlocks), ensuring that it acquires the lock at some point. However, with the debug spinlock code, livelock can easily occur due to the use of try_lock, which obviously cannot put the CPU in that "queue". This queueing mechanism is implemented in both x86 and ARM spinlock code. Note that the situation mentioned above is not hypothetical. A real problem was encountered where CPU-0 was running hrtimer_cancel with interrupts disabled, and CPU-1 was attempting to run the hrtimer that CPU-0 was trying to cancel. Address this by actually attempting arch_spin_lock once it is suspected that there is a spinlock lockup. If we're in a situation that is described above, the arch_spin_lock should succeed; otherwise other timeout mechanisms (e.g., watchdog) should alert the system of a lockup. Therefore, if there is a genuine system problem and the spinlock can't be acquired, the end result (irrespective of this change being present) is the same. If there is a livelock caused by the debug code, this change will allow the lock to be acquired, depending on the implementation of the lower level arch specific spinlock code. Signed-off-by: Vikram Mulukutla --- lib/spinlock_debug.c | 32 ++-- 1 files changed, 18 insertions(+), 14 deletions(-) diff --git a/lib/spinlock_debug.c b/lib/spinlock_debug.c index eb10578..94624e1 100644 --- a/lib/spinlock_debug.c +++ b/lib/spinlock_debug.c @@ -107,23 +107,27 @@ static void __spin_lock_debug(raw_spinlock_t *lock) { u64 i; u64 loops = loops_per_jiffy * HZ; - int print_once = 1; - for (;;) { - for (i = 0; i < loops; i++) { - if (arch_spin_trylock(&lock->raw_lock)) - return; - __delay(1); - } - /* lockup suspected: */ - if (print_once) { - print_once = 0; - spin_dump(lock, "lockup suspected"); + for (i = 0; i < loops; i++) { + if (arch_spin_trylock(&lock->raw_lock)) + return; + __delay(1); + } + /* lockup suspected: */ + spin_dump(lock, "lockup suspected"); #ifdef CONFIG_SMP - trigger_all_cpu_backtrace(); + trigger_all_cpu_backtrace(); #endif - } - } + + /* +* In case the trylock above was causing a livelock, give the lower +* level arch specific lock code a chance to acquire the lock. We have +* already printed a warning/backtrace at this point. The non-debug arch +* specific code might actually succeed in acquiring the lock. If it is +* not successful, the end-result is the same - there is no forward +* progress. +*/ + arch_spin_lock(&lock->raw_lock); } void do_raw_spin_lock(raw_spinlock_t *lock) -- 1.7.8.3 The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 1/5] firmware: add extensible driver data params
Hi Greg, Luis, On 2017-06-17 12:38, Greg KH wrote: On Tue, Jun 13, 2017 at 09:40:11PM +0200, Luis R. Rodriguez wrote: On Tue, Jun 13, 2017 at 11:05:48AM +0200, Greg KH wrote: > On Mon, Jun 05, 2017 at 02:39:33PM -0700, Luis R. Rodriguez wrote: > > As the firmware API evolves we keep extending functions with more arguments. > > Stop this nonsense by proving an extensible data structure which can be used > > to represent both user parameters and private internal parameters. > > Let's take a simple C function interface and make it a more complex > data-driven interface that is impossible to understand and obviously > understand how it is to be used and works! The firmware codebase was already complex! Heh, I'm not arguing with you there :) What you have to ask yourself really is if this makes it *less complex* and helps *clean things up* in a much better way than it was before. Also does it allow us to *pave the way for new functionality easily*, without creating further mess? I agree, that's what I'm saying here. I just do not see that happening with your patch set at all. It's adding more code, a more complex way to interact with the subsystem, and not making driver writer lives any easier at all that I can see. Again, the code is now bigger, does more, with not even any real benefit for existing users. If not, what concrete alternatives do you suggest? It's working, so leave it alone? :) So I wanted to note here that I had a similar internal discussion with Stephen Boyd when he first upstreamed the request_firmware_into_buf API. I actually did change things a bit to pass around a 'desc' structure with all the flags and parameters in our internal vendor tree [1]. This is a sort of an ultra-lite version of what Luis is trying to do - extensibility - but does not change the API for driver owners. Existing APIs become wrappers around the new API. My primary reason then was the number of things being passed around between layers of functions inside firmware_class seemed a bit untenable. Just like Greg, Stephen also brought up the argument about why the unnecessary complication to the API without any measurable benefit - this seemed a fair argument to me at that time. But here's my point - if Luis agrees that _this_ patchset is going slightly Mjolnir, perhaps a light internal rework inside firmware_class might be more suitable towards the extensibility goal while keeping driver API usage as simple as it is today (even if you hate my patch for various reasons)? Thanks, Vikram [1] - https://source.codeaurora.org/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=msm-3.18&id=7aa7efd3c150840369739893a84bd1d9f9774319
Additional compiler barrier required in sched_preempt_enable_no_resched?
Hi, I came across a piece of engineering code that looked like: preempt_disable(); /* --cut, lots of code-- */ preempt_enable_no_resched(); put_user() preempt_disable(); (If you wish to seriously question the usage of the preempt API in this manner, I unfortunately have no comment since I didn't write the code.) This particular block of code was causing lockups and crashes on a certain ARM64 device. The generated assembly revealed that the compiler was simply optimizing out the increment and decrement of the preempt count, allowing put_user to run without preemption enabled, causing all sorts of badness. Since put_user doesn't actually access the preempt count and translates to just a few instructions without any branching, I suppose that the compiler figured it was OK to optimize. The immediate solution is to add a compiler barrier to the code above, but should sched_preempt_enable_no_resched have an additional compiler barrier after (has one before already) the preempt-count decrement to prevent this sort of thing? Thanks, Vikram -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/13/2016 7:58 AM, Peter Zijlstra wrote: On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote: Hi, I came across a piece of engineering code that looked like: preempt_disable(); /* --cut, lots of code-- */ preempt_enable_no_resched(); put_user() preempt_disable(); (If you wish to seriously question the usage of the preempt API in this manner, I unfortunately have no comment since I didn't write the code.) I'm with Thomas here, that's broken and should not be done. Ok. I did in fact zero in on this code by replacing each instance of preempt_enable_no_resched with preempt_enable one by one (there were several uses in the driver). I will ask the original developer to consider using preempt_enable. This particular block of code was causing lockups and crashes on a certain ARM64 device. The generated assembly revealed that the compiler was simply optimizing out the increment and decrement of the preempt count, allowing put_user to run without preemption enabled, causing all sorts of badness. Since put_user doesn't actually access the preempt count and translates to just a few instructions without any branching, I suppose that the compiler figured it was OK to optimize. The immediate solution is to add a compiler barrier to the code above, but should sched_preempt_enable_no_resched have an additional compiler barrier after (has one before already) the preempt-count decrement to prevent this sort of thing? I think the below would be sufficient; IIRC the compiler may not combine or elide volatile operations. --- include/asm-generic/preempt.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return ¤t_thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. Thanks, Vikram
Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
On 5/14/2016 8:39 AM, Thomas Gleixner wrote: On Fri, 13 May 2016, Vikram Mulukutla wrote: On 5/13/2016 7:58 AM, Peter Zijlstra wrote: diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h index 5d8ffa3e6f8c..c1cde3577551 100644 --- a/include/asm-generic/preempt.h +++ b/include/asm-generic/preempt.h @@ -7,10 +7,10 @@ static __always_inline int preempt_count(void) { - return current_thread_info()->preempt_count; + return READ_ONCE(current_thread_info()->preempt_count); } -static __always_inline int *preempt_count_ptr(void) +static __always_inline volatile int *preempt_count_ptr(void) { return ¤t_thread_info()->preempt_count; } Thanks Peter, this patch worked for me. The compiler no longer optimizes out the increment/decrement of the preempt_count. I have a hard time to understand why the compiler optimizes out stuff w/o that patch. We already have: #define preempt_disable() \ do { \ preempt_count_inc(); \ barrier(); \ } while (0) #define sched_preempt_enable_no_resched() \ do { \ barrier(); \ preempt_count_dec(); \ } while (0) #define preempt_enable() \ do { \ barrier(); \ if (unlikely(preempt_count_dec_and_test())) \ __preempt_schedule(); \ } while (0) So the barriers already forbid that the compiler reorders code. How on earth is the compiler supposed to optimize the dec/inc out? While I cannot claim more than a very rudimentary knowledge of compilers, this was the initial reaction that I had as well. But then the barriers are in the wrong spots for the way the code was used in the driver in question. preempt_disable has the barrier() _after_ the increment, and sched_preempt_enable_no_resched has it _before_ the decrement. Therefore, if one were to use preempt_enable_no_resched followed by preempt_disable, there is no compiler barrier between the decrement and the increment statements. Whether this should translate to such a seemingly aggressive optimization is something I'm not completely sure of. There is more code than the preempt stuff depending on barrier() and expecting that the compiler does not optimize out things at will. Are we supposed to hunt all occurences and amend them with READ_ONCE or whatever one by one? I think the barrier is working as intended for the sequence of preempt_disable followed by preempt_enable_no_resched. Thanks, tglx As a simple experiment I used gcc on x86 on the following C program. This was really to convince myself rather than you and Peter! #include #define barrier() __asm__ __volatile__("": : :"memory") struct thread_info { int pc; }; #define preempt_count() \ ti_ptr->pc #define preempt_disable() \ do \ { \ preempt_count() += 1; \ barrier(); \ } \ while (0) #define preempt_enable() \ do \ { \ barrier(); \ preempt_count() -= 1; \ } \ while (0) struct thread_info *ti_ptr; int main(void) { struct thread_info ti; ti_ptr = &ti; int b; preempt_enable(); b = b + 500; preempt_disable(); printf("b = %d\n", b); return 0; } With gcc -O2 I get this (the ARM compiler behaves similarly): 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: ba f4 01 00 00 mov$0x1f4,%edx 400480: be 14 06 40 00 mov$0x400614,%esi 400485: bf 01 00 00 00 mov$0x1,%edi 40048a: 31 c0 xor%eax,%eax 40048c: e8 cf ff ff ff callq 400460 <__printf_chk@plt> 400491: 31 c0 xor%eax,%eax 400493: 48 83 c4 18 add$0x18,%rsp 400497: c3 retq If I swap preempt_enable and preempt_disable I get this: 00400470 : 400470: 48 83 ec 18 sub$0x18,%rsp 400474: 48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip) 40047b: 83 04 24 01 addl $0x1,(%rsp) 40047f: 48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax 400486: ba f4 01 00 00 mov$0x1f4,%edx 40048b: be 14 06 40 00 mov$0x400614,%esi 400490: bf 01 00 00 00 mov$0x1,%edi 400495: 83 28 01subl $0x1,(%rax) 400498: 31 c0 xor%eax,%eax 40049a: e8 c1 ff ff ff callq 400460 <__printf_chk@plt> 40049f: 31 c0 xor%eax,%eax 4004a1: 48 83 c4 18 add$0x18,%rsp 4004a5: c3 retq Note: If I place ti_ptr on the stack, the ordering/barriers no longer matter, the inc/dec is always optimized out. I suppose the compiler does treat current_thread_info as coming from a dif
[PATCH] tracing: Fix unmapping loop in tracing_mark_write
Commit 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce introduced an array map_pages that contains the addresses returned by kmap_atomic. However, when unmapping those pages, map_pages[0] is unmapped before map_pages[1], breaking the nesting requirement as specified in the documentation for kmap_atomic/kunmap_atomic. This was caught by the highmem debug code present in kunmap_atomic. Fix the loop to do the unmapping properly. Reviewed-by: Stephen Boyd Reported-by: Lime Yang Signed-off-by: Vikram Mulukutla --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ab76b7b..bceed34 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4931,7 +4931,7 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, *fpos += written; out_unlock: - for (i = 0; i < nr_pages; i++){ + for (i = nr_pages - 1; i >= 0; i--) { kunmap_atomic(map_page[i]); put_page(pages[i]); } -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking
We propose Window-Assisted Load Tracking (WALT) as an alternative or additional load tracking scheme in lieu of or along with PELT, one that in our estimation better tracks task demand and CPU utilization especially for use cases on mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations, (specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics have been used to guide task placement and p-state selection (load balancing in CFS still uses PELT statistics). WALT is now present in the android-common kernel as well. This RFC is not an attempt to supplant PELT statistics; this is more about starting a discussion; perhaps one that may result in changes to PELT to address the requirements listed below. Background __ 1) Classifying a task as "heavy" faster helps in mobile world use-cases such as scrolling a UI or browsing a web page where tasks exhibit sporadically heavy load, allowing for instance, faster migration to a more capable CPU (assuming the scheduler uses task util and CPU capacity as inputs during task placement decision making, e.g EAS). Reclassification of a light task as heavy is also important - for example, a rendering thread may change its demand depending on what is being shown on-screen. We believe that quickly associating a task with its necessary resources requires a more dynamic demand or utilization signal than provided by PELT, which is subject to the geometric series and does not differentiate between task ramp-up and ramp-down. E.g, with a 32ms halflife, PELT takes ~74ms to reach 80% nice-0-load_avg/util_avg for a continuously executing 100% task, and ~139ms to reach 95%. 2) Historical task demand (even if a task sleeps for say 300ms due to network delay) is required to restore both CPU and frequency resources to meet performance throughput as well. PELT with a 32ms 'half-life' takes just ~213ms to decay utilization/load (effectively) zero, forgetting that task's history, requiring the task to re-execute its workload to gain that load_avg/util_avg again. 3) There is a need to quickly ramp up CPU frequency as a result of heavy CPU utilization, and to reduce frequency whenever possible to save power. Consider as an example a single 100% thread running continuously on a single core using schedutil with a 10ms transition latency. On an i7 8-core SMP machine with SMT, PELT takes 60-70ms to ramp to FMAX from FMIN. 4) PELT's blocked util decay implies that the underlying cpufreq governor would have to delay dropping frequency for longer than necessary (especially for mobile world usecases). PELT with a 32ms 'half-life' takes ~108ms to reduce the runqueue utilization/nice-0-load from max to 10%. Blocked load/util tracking is of course a conservative approach recognizing that load might return to the runqueue, but one that we feel is inflexible when it comes to certain real world use cases. WALT vs PELT WALT retains the per-entity properties of PELT; demand and utilization are accounted for at the task level, and a CPU rq's utilization contribution is the sum of its children's contributions. Wallclock time, however, is divided into a series of windows - to be clear about what WALT tracks exactly, we introduce the terms task demand and CPU busy-time here and show the benefits versus PELT that apply to the four points mentioned above: 1) Task Demand - The contribution of a task's running and runnable time to a window, averaged over N sliding windows. Note that a window has to be *completed* before its contribution is considered. A task's demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows. This metric can be roughly thought of as an unweighted load_avg without blocked load. WALT can take as little as one window of execution to report max task demand, although with freq and capacity invariance, this tends to be about 3 windows with a good frequency governor (so that's about 30ms with a 10ms window). This allows faster and more accurate task placement decisions. 2) WALT "forgets" blocked time entirely, i.e. the sliding windows exist only when the task is on the runqueue or running. This allows rapid reclassification of the same task as heavy after a short sleep. Thus a task does not need to re-execute to gain its demand once more and can be migrated up to a big CPU immediately. 3) CPU busy time - The sum of execution times of all tasks in the most recently completed window. For the same machine described above and with schedutil, WALT with a 10ms window takes around 30-50ms to ramp to FMAX, and with a 20ms window takes between 60-100ms (note that it is assumed that schedutil has a transition latency equal to or greater than the window size). 4) WALT "forgets" cpu utilization as soon as tasks are taken off of the runqueue, and thus the cpufreq governor can choose to drop frequ
[RFC PATCH 3/3] sched: Introduce WALT hooks into core and scheduling classes
From: Srivatsa Vaddagiri Add the necessary hooks to core and the various scheduling classes that will allow WALT to track CPU utilization and handle task migration between CPUs as well. With CONFIG_SCHED_WALT enabled, schedutil will use WALT's cpu utilization metric by default. This can be switched to PELT's util_avg at runtime by the following command: echo 0 > /proc/sys/kernel/sched_use_walt_metrics Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Vikram Mulukutla --- kernel/sched/core.c | 29 - kernel/sched/deadline.c | 7 +++ kernel/sched/debug.c| 9 + kernel/sched/fair.c | 9 +++-- kernel/sched/rt.c | 6 ++ 5 files changed, 57 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44817c6..3b7f67d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -91,6 +91,8 @@ #define CREATE_TRACE_POINTS #include +#include "walt.h" + DEFINE_MUTEX(sched_domains_mutex); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); @@ -991,6 +993,7 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new p->on_rq = TASK_ON_RQ_MIGRATING; dequeue_task(rq, p, 0); + walt_prepare_migrate(p, rq, true); set_task_cpu(p, new_cpu); raw_spin_unlock(&rq->lock); @@ -998,6 +1001,7 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new raw_spin_lock(&rq->lock); BUG_ON(task_cpu(p) != new_cpu); + walt_finish_migrate(p, rq, true); enqueue_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; check_preempt_curr(rq, p, 0); @@ -1257,7 +1261,9 @@ static void __migrate_swap_task(struct task_struct *p, int cpu) p->on_rq = TASK_ON_RQ_MIGRATING; deactivate_task(src_rq, p, 0); + walt_prepare_migrate(p, src_rq, true); set_task_cpu(p, cpu); + walt_finish_migrate(p, dst_rq, true); activate_task(dst_rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; check_preempt_curr(dst_rq, p, 0); @@ -2072,13 +2078,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) */ smp_cond_load_acquire(&p->on_cpu, !VAL); + raw_spin_lock(&task_rq(p)->lock); + walt_update_task_ravg(p, task_rq(p), TASK_WAKE, walt_ktime_clock(), 0); + raw_spin_unlock(&task_rq(p)->lock); + p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); if (task_cpu(p) != cpu) { wake_flags |= WF_MIGRATED; + walt_prepare_migrate(p, task_rq(p), false); set_task_cpu(p, cpu); + walt_finish_migrate(p, cpu_rq(cpu), false); } #endif /* CONFIG_SMP */ @@ -2129,8 +2141,10 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie trace_sched_waking(p); - if (!task_on_rq_queued(p)) + if (!task_on_rq_queued(p)) { + walt_update_task_ravg(p, rq, TASK_WAKE, walt_ktime_clock(), 0); ttwu_activate(rq, p, ENQUEUE_WAKEUP); + } ttwu_do_wakeup(rq, p, 0, cookie); if (schedstat_enabled()) @@ -2196,6 +2210,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) p->se.nr_migrations = 0; p->se.vruntime = 0; INIT_LIST_HEAD(&p->se.group_node); + walt_init_new_task_load(p); #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq= NULL; @@ -2570,6 +2585,8 @@ void wake_up_new_task(struct task_struct *p) rq = __task_rq_lock(p, &rf); post_init_entity_util_avg(&p->se); + walt_mark_task_starting(p); + activate_task(rq, p, 0); p->on_rq = TASK_ON_RQ_QUEUED; trace_sched_wakeup_new(p); @@ -3071,6 +3088,7 @@ void scheduler_tick(void) update_rq_clock(rq); curr->sched_class->task_tick(rq, curr, 0); cpu_load_update_active(rq); + walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, walt_ktime_clock(), 0); calc_global_load_tick(rq); raw_spin_unlock(&rq->lock); @@ -3322,6 +3340,7 @@ static void __sched notrace __schedule(bool preempt) struct pin_cookie cookie; struct rq *rq; int cpu; + u64 wallclock; cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -3385,6 +3404,9 @@ static void __sched notrace __schedule(bool preempt) update_rq_clock(rq); next = pick_next_task(rq, prev, cookie); + wallclock = walt_ktime_clock(); + walt_update_task_ravg(prev, rq, PUT_PREV_TASK, wallclock, 0); + walt_update_task_ravg(ne
[RFC PATCH 1/3] sched: Introduce structures necessary for WALT
From: Srivatsa Vaddagiri Add the per-task and per-runqueue data structures that will later be used by Window Assisted Load Tracking (WALT) to estimate task demand and CPU utilization. Move cap_scale into sched.h as that will be needed by WALT as well to implement frequency and capacity invariance. Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Vikram Mulukutla --- include/linux/sched.h | 39 +++ kernel/sched/fair.c | 2 -- kernel/sched/sched.h | 8 3 files changed, 47 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 62c68e5..64f8bec 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -315,6 +315,21 @@ extern char ___assert_task_state[1 - 2*!!( /* Task command name length */ #define TASK_COMM_LEN 16 +/* + * These events may be replaced with a combination of existing scheduler flags + * provided that that doesn't make the implementation too fragile. + */ +enum task_event { + PUT_PREV_TASK = 0, + PICK_NEXT_TASK = 1, + TASK_WAKE = 2, + TASK_MIGRATE= 3, + TASK_UPDATE = 4, + IRQ_UPDATE = 5, +}; + +extern char *task_event_names[]; + #include /* @@ -1320,6 +1335,25 @@ struct sched_statistics { }; #endif +#ifdef CONFIG_SCHED_WALT + +/* ravg represents capacity scaled cpu-usage of tasks */ +struct ravg { + /* +* 'mark_start' marks the most recent event for a task +* +* 'curr_window' represents task's cpu usage in its most recent +* window +* +* 'prev_window' represents task's cpu usage in the window prior +* to the one represented by 'curr_window' + */ + u64 mark_start; + u32 curr_window, prev_window; +}; +#endif + + struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; @@ -1480,6 +1514,11 @@ struct task_struct { const struct sched_class *sched_class; struct sched_entity se; struct sched_rt_entity rt; + +#ifdef CONFIG_SCHED_WALT + struct ravg ravg; +#endif + #ifdef CONFIG_CGROUP_SCHED struct task_group *sched_task_group; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ce8b244..39c826d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2674,8 +2674,6 @@ static u32 __compute_runnable_contrib(u64 n) return contrib + runnable_avg_yN_sum[n]; } -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) - /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c64fc51..9bf6925 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -65,6 +65,8 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } # define scale_load_down(w)(w) #endif +#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) + /* * Task weight (visible to users) and its load (invisible to users) have * independent resolution, but they should be well calibrated. We use @@ -664,6 +666,12 @@ struct rq { u64 max_idle_balance_cost; #endif +#ifdef CONFIG_SCHED_WALT + u64 window_start; + u64 curr_runnable_sum; + u64 prev_runnable_sum; +#endif /* CONFIG_SCHED_WALT */ + #ifdef CONFIG_IRQ_TIME_ACCOUNTING u64 prev_irq_time; #endif -- TheMan
[RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking
From: Srivatsa Vaddagiri This patch implements an alternative window-based CPU utilization tracking mechanism in the scheduler. Per task and per CPU counters are updated with utilization statistics using a synchronized (across CPUs) time source and a single statistic (prev_runnable_sum) is fed to schedutil in an upcoming patch. A windowed view of time (window size determined by walt_ravg_window) is used to determine CPU utilization. There are two per-CPU-rq quantities maintained by Window Assisted Load Tracking (WALT), both normalized to the max possible frequency and the max efficiency (IPC) of that CPU (provided the arch supports the invariance APIs): curr_runnable_sum: aggregate utilization of all tasks that executed during the current (not yet completed) window prev_runnable_sum: aggregate utilization of all tasks that executed during the most recent completed window prev_runnable_sum is the primary statistic used to guide CPU frequency in lieu of PELT's cfs_rq->avg.util_avg. No additional policy is imposed on this statistic, the assumption being that the consumer (e.g., schedutil) will perform appropriate policy decisions (e.g., headroom) before deciding the next P-state. Corresponding to the aggregate statistics, WALT also mantains the following stats per task: curr_window - represents the cpu utilization of the task in its most recently tracked window prev_window - represents cpu utilization of task in the window prior to the one being tracked by curr_window WALT statistic updates are event driven, with updates occurring in scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch), task wakeup and during task migration. Migration simply involves removing a tasks's curr_window and prev_window from the source CPU's curr_runnable sum and prev_runnable_sum, and adding the per-task counters to the destination CPU's aggregate CPU counters. Execution time in an IRQ handler is accounted in a CPU's curr_runnable_sum statistic, provided that the CPU was also executing the idle task for the duration of the interrupt handler. Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU rq has tasks blocked on IO, idle-task execution is accounted in per-task and per-CPU counters. Setting walt_io_is_busy will also cause interrupt handlers in the idle task to update counters as if the idle task was executing (instead of just the interrupt handler execution time). The major tunable provided by WALT is walt_ravg_window, which represents window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy (described above) is set to 0 by default. Potential upcoming changes/improvements include: the use of sched_clock instead of ktime_get as a time source and support for an unsynchronized (across CPUs) time source. The event codes (PUT_PREV_TASK etc.) may also potentially be replaced by combinations of existing flags, provided that doesn't make the implementation too fragile. Signed-off-by: Srivatsa Vaddagiri Signed-off-by: Vikram Mulukutla --- include/linux/sched/sysctl.h | 1 + init/Kconfig | 9 + kernel/sched/Makefile| 1 + kernel/sched/cputime.c | 10 +- kernel/sched/walt.c | 428 +++ kernel/sched/walt.h | 69 +++ kernel/sysctl.c | 11 ++ 7 files changed, 528 insertions(+), 1 deletion(-) create mode 100644 kernel/sched/walt.c create mode 100644 kernel/sched/walt.h diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 22db1e6..7007815 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -31,6 +31,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay; extern unsigned int sysctl_numa_balancing_scan_period_min; extern unsigned int sysctl_numa_balancing_scan_period_max; extern unsigned int sysctl_numa_balancing_scan_size; +extern unsigned int sysctl_sched_use_walt_metrics; #ifdef CONFIG_SCHED_DEBUG extern unsigned int sysctl_sched_migration_cost; diff --git a/init/Kconfig b/init/Kconfig index cac3f09..1e629ac 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -390,6 +390,15 @@ config IRQ_TIME_ACCOUNTING If in doubt, say N here. +config SCHED_WALT + bool "Support window based load tracking" + depends on SMP + help + This feature will allow the scheduler to maintain a tunable window + based set of metrics for tasks and runqueues. These metrics can be + used to guide task placement as well as task frequency requirements + for cpufreq governors. + config BSD_PROCESS_ACCT bool "BSD Process Accounting" depends on MULTIUSER diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 5e59b83..41ada04 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,6 +19,7 @@ obj-y += core.o loadavg.o clock.o cputime.o obj-y +=
Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking
On 2016-10-28 00:29, Peter Zijlstra wrote: On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote: We propose Window-Assisted Load Tracking (WALT) as an alternative or additional load tracking scheme in lieu of or along with PELT, one that in our estimation better tracks task demand and CPU utilization especially for use cases on mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations, (specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics have been used to guide task placement and p-state selection (load balancing in CFS still uses PELT statistics). WALT is now present in the android-common kernel as well. And how come we only learn of it after its already shipping? Isn't that arse backwards? Yes, but also we were not confident that it would be close to being acceptable upstream since it was intricately tied to our HMP scheduler. However now that more parties including the folks at ARM are interested, and given that EAS exists and schedutil was merged into mainline, we felt it the right time to try and introduce the concept. In general we are seriously trying to get more things upstream and converge.
Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking
On 2016-10-28 00:49, Peter Zijlstra wrote: On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote: This RFC patch has been tested on live X86 machines with the following sanity and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick Bellasi for initial code reviews): (Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti graphics, with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2 baseline. WALT window size was 10ms. Only deltas above 3% are considered non-noise.Power measured with Intel RAPL counters) Was this comparison done using the use_walt_metric sysctl knob? Yes, it was. You will want to see numbers against a pure 4.8.2 without any of the WALT code, correct?
Re: [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking
On 2016-10-28 00:43, Peter Zijlstra wrote: On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote: +u64 walt_ktime_clock(void) +{ + if (unlikely(walt_ktime_suspended)) + return ktime_to_ns(ktime_last); + return ktime_get_ns(); +} +static int walt_suspend(void) +{ + ktime_last = ktime_get(); + walt_ktime_suspended = true; + return 0; +} No, ktime_get() will not be used in the scheduler. Imagine the joy if that thing ends up being the HPET. Agreed, this is an artifact from the full implementation that feeds into the interactive governor, and thus both needed to use the same time source. It shall go away.
Re: [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking
On 2016-10-28 00:46, Peter Zijlstra wrote: On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote: +void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq, bool locked) +{ + u64 wallclock; + unsigned long flags; + + if (!p->on_rq && p->state != TASK_WAKING) + return; + + if (locked == false) + raw_spin_lock_irqsave(&dest_rq->lock, flags); + + + if (locked == false) + raw_spin_unlock_irqrestore(&dest_rq->lock, flags); +} + +void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq, bool locked) +{ + u64 wallclock; + unsigned long flags; + + if (!p->on_rq && p->state != TASK_WAKING) + return; + + if (locked == false) + raw_spin_lock_irqsave(&src_rq->lock, flags); + + + if (locked == false) + raw_spin_unlock_irqrestore(&src_rq->lock, flags); +} Seriously bad style that. Please, less bonghits before writing code. This was my bad personal attempt at eliminating double-locking from the original code. This was pointed out earlier and shall go away once I can come up with a way to merge this into enqeue/dequeue sans bonghits :-)
spin_lock behavior with ARM64 big.Little/HMP
Hello, This isn't really a bug report, but just a description of a frequency/IPC dependent behavior that I'm curious if we should worry about. The behavior is exposed by questionable design so I'm leaning towards don't-care. Consider these threads running in parallel on two ARM64 CPUs running mainline Linux: (Ordering of lines between the two columns does not indicate a sequence of execution. Assume flag=0 initially.) LittleARM64_CPU @ 300MHz (e.g.A53) | BigARM64_CPU @ 1.5GHz (e.g. A57) -+-- spin_lock_irqsave(s) | local_irq_save() /* critical section */ flag = 1 | spin_lock(s) spin_unlock_irqrestore(s)| while (!flag) { | spin_unlock(s) | cpu_relax(); | spin_lock(s) | } | spin_unlock(s) | local_irq_restore() I see a livelock occurring where the LittleCPU is never able to acquire the lock, and the BigCPU is stuck forever waiting on 'flag' to be set. Even with ticket spinlocks, this bit of code can cause a livelock (or very long delays) if BigCPU runs fast enough. Afaics this can only happen if the LittleCPU is unable to put its ticket in the queue (i.e, increment the next field) since the store-exclusive keeps failing. The problem is not present on SMP, and is mitigated by adding enough additional clock cycles between the unlock and lock in the loop running on the BigCPU. On big.Little, if both threads are scheduled on the same cluster within the same clock domain, the problem is avoided. Now the infinite loop may seem like questionable design but the problem isn't entirely hypothetical; if BigCPU calls hrtimer_cancel with interrupts disabled, this scenario can result if the hrtimer is about to run on a littleCPU. It's of course possible that there's just enough intervening code for the problem to not occur. At the very least it seems that loops like the one running in the BigCPU above should come with a WARN_ON(irqs_disabled) or with a sufficient udelay() instead of the cpu_relax. Thoughts? Thanks, Vikram
Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking
On 2016-10-28 01:49, Peter Zijlstra wrote: On Fri, Oct 28, 2016 at 12:57:05AM -0700, Vikram Mulukutla wrote: On 2016-10-28 00:49, Peter Zijlstra wrote: >On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote: >>This RFC patch has been tested on live X86 machines with the following >>sanity >>and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick >>Bellasi >>for initial code reviews): >> >>(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti >>graphics, >>with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2 >>baseline. WALT window size was 10ms. Only deltas above 3% are considered >>non-noise.Power measured with Intel RAPL counters) > >Was this comparison done using the use_walt_metric sysctl knob? Yes, it was. You will want to see numbers against a pure 4.8.2 without any of the WALT code, correct? Yep, because with the sysctl we still run all the accounting code. So esp things like the hackbench run are meaningless (note that even the CONFIG thing doesn't take out everything). Also, I think it makes sense to always (also) compare against the "performance" governor. That way you can see the drop in absolute performance etc.. Ok, here are some numbers. I should be able to get the rest during the week. The averages are pretty close, so I figured I would include some percentile numbers. PELT and WALT numbers are with schedutil. On average it seems we're introducing about 0.5% overhead with the current additional accounting. Test: perf bench sched messaging -g 50 -l 5000 +-+-+--+--+---+ | 4.8.2-walt | Performance | Ondemand | PELT | WALT(10ms)| +-+-+--+--+-- + | | | | | | | 90th | 17.077 | 17.088 | 17.088 | 17.159| | | | | | | | 96th | 17.117 | 17.421 | 17.198 | 17.343| | | | | | | | Average| 16.910 | 16.916 | 16.937 | 16.962| | | | | | | +-+-+--+--+---+ | 4.8.2-raw | Performance | Ondemand | PELT | WALT(10ms)| +-+-+--+--+---+ | | | | | | | 90th | 16.919 | 17.1694 | 16.986 | 0 | | | | | | | | 96th | 16.980 | 17.364 | 17.052 | 0 | | | | | | | | Average| 16.841 | 16.902 | 16.860 | 0 | +-+-+--+--+---+
Re: spin_lock behavior with ARM64 big.Little/HMP
Hi Sudeep, Thanks for taking a look! On 2016-11-18 02:30, Sudeep Holla wrote: Hi Vikram, On 18/11/16 02:22, Vikram Mulukutla wrote: Hello, This isn't really a bug report, but just a description of a frequency/IPC dependent behavior that I'm curious if we should worry about. The behavior is exposed by questionable design so I'm leaning towards don't-care. Consider these threads running in parallel on two ARM64 CPUs running mainline Linux: Are you seeing this behavior with the mainline kernel on any platforms as we have a sort of workaround for this ? If I understand that workaround correctly, the ARM timer event stream is used to periodically wake up CPUs that are waiting in WFE, is that right? I think my scenario below may be different because LittleCPU doesn't actually wait on a WFE event in the loop that is trying to increment lock->next, i.e. it's stuck in the following loop: ARM64_LSE_ATOMIC_INSN( /* LL/SC */ " prfmpstl1strm, %3\n" "1: ldaxr %w0, %3\n" " add %w1, %w0, %w5\n" " stxr%w2, %w1, %3\n" " cbnz%w2, 1b\n", I have been testing internal platforms; I'll try to test on something available publicly that's b.L. In any case, the timer event stream was enabled when I tried this out. (Ordering of lines between the two columns does not indicate a sequence of execution. Assume flag=0 initially.) LittleARM64_CPU @ 300MHz (e.g.A53) | BigARM64_CPU @ 1.5GHz (e.g. A57) -+-- spin_lock_irqsave(s) | local_irq_save() /* critical section */ flag = 1 | spin_lock(s) spin_unlock_irqrestore(s)| while (!flag) { | spin_unlock(s) | cpu_relax(); | spin_lock(s) | } | spin_unlock(s) | local_irq_restore() [...] Thanks, Vikram
Re: spin_lock behavior with ARM64 big.Little/HMP
Hi Sudeep, Interesting. Just curious if this is r0p0/p1 A53 ? If so, is the errata 819472 enabled ? Sorry for bringing this up after the loo-ong delay, but I've been assured that the A53 involved is > r0p1. I've also confirmed this problem on multiple internal platforms, and I'm pretty sure that it occurs on any b.L out there today. Also, we found the same problematic lock design used in the workqueue code in the kernel, causing the same livelock. It's very very rare and requires a perfect set of circumstances. If it would help I can provide a unit test if you folks would be generous enough to test it on the latest Juno or something b.L that's also upstream. Thanks, Vikram
Re: [RFD PATCH 3/5] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE
OK So there are two pieces here. One is that if we want *all* drivers to work with schedutil, we need to keep the kthread for the ones that will never be reworked (because nobody cares etc). But then perhaps the kthread implementation may be left alone (because nobody cares etc). The second one is that there are drivers operating in-context that work with schedutil already, so I don't see major obstacles to making more drivers work that way. That would be only a matter of reworking the drivers in question. Thanks, Rafael There are some MSM platforms that do need a kthread and would love to use schedutil. This is all mainly due to the point that Vincent raised; having to actually wait for voltage transitions before clock switches. I can't speak about the future, but that's the situation right now. Leaving the kthread alone for now would be appreciated! Thanks, Vikram
[PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme
kthread_park waits for the target kthread to park itself with __kthread_parkme using a completion variable. __kthread_parkme - which is invoked by the target kthread - sets the completion variable before calling schedule() to voluntarily get itself off of the runqueue. This causes an interesting race in the hotplug path. takedown_cpu() invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that cpuhp/X is off of X's runqueue, only that the thread has executed __kthread_parkme and set the completion. cpuhp/X may have been preempted out before calling schedule() to voluntarily sleep. takedown_cpu proceeds to run the stopper thread on CPU_X which promptly migrates off the still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity mask to something other than CPU_X alone. This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at some later point. But if that doesn't happen (for example, if there's an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call for CPU_X will race with the still-on-rq condition. Even now we're functionally OK because there is a wait_task_inactive in the kthread_unpark(), BUT the following happens: [ 12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002 [ 12.472749] Modules linked in: [ 12.472756] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680 [ 12.472758] Hardware name: X [ 12.472760] Call trace: [ 12.472773] [] dump_backtrace+0x0/0x198 [ 12.472777] [] show_stack+0x14/0x1c [ 12.472781] [] dump_stack+0x8c/0xac [ 12.472786] [] __schedule_bug+0x54/0x70 [ 12.472792] [] __schedule+0x6b4/0x928 [ 12.472794] [] schedule+0x3c/0xa0 [ 12.472797] [] schedule_hrtimeout_range_clock+0x80/0xec [ 12.472799] [] schedule_hrtimeout+0x18/0x20 [ 12.472803] [] wait_task_inactive+0x1a0/0x1a4 [ 12.472806] [] __kthread_bind_mask+0x20/0x7c [ 12.472809] [] __kthread_bind+0x28/0x30 [ 12.472811] [] __kthread_unpark+0x5c/0x60 [ 12.472814] [] kthread_unpark+0x24/0x2c [ 12.472818] [] cpuhp_online_idle+0x50/0x90 [ 12.472822] [] cpu_startup_entry+0x3c/0x1d4 [ 12.472824] [] secondary_start_kernel+0x164/0x1b4 Since the kthread_unpark is invoked from a preemption-disabled context, wait_task_inactive's action of invoking schedule is invalid, causing the splat. Note that kthread_bind_mask is correctly attempting to re-set the affinity mask since cpuhp is a per-cpu smpboot thread. Instead of adding an expensive wait_task_inactive inside kthread_park() or trying to muck with the hotplug code, let's just ensure that the completion variable and the schedule happen atomically inside __kthread_parkme. This focuses the fix to the hotplug requirement alone, and removes the unnecessary migration of cpuhp/X. Signed-off-by: Vikram Mulukutla --- kernel/kthread.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index 26db528..7ad3354 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self) { __set_current_state(TASK_PARKED); while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) { + /* +* Why the preempt_disable? +* Hotplug needs to ensure that 'self' is off of the runqueue +* as well, before scheduling the stopper thread that will +* migrate tasks off of the runqeue that 'self' was running on. +* This avoids unnecessary migration work and also ensures that +* kthread_unpark in the cpu_up path doesn't race with +* __kthread_parkme. +*/ + preempt_disable(); if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags)) complete(&self->parked); - schedule(); + schedule_preempt_disabled(); + preempt_enable(); __set_current_state(TASK_PARKED); } clear_bit(KTHREAD_IS_PARKED, &self->flags); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme
correcting Thomas Gleixner's email address. s/linuxtronix/linutronix On 6/26/2017 3:18 PM, Vikram Mulukutla wrote: kthread_park waits for the target kthread to park itself with __kthread_parkme using a completion variable. __kthread_parkme - which is invoked by the target kthread - sets the completion variable before calling schedule() to voluntarily get itself off of the runqueue. This causes an interesting race in the hotplug path. takedown_cpu() invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that cpuhp/X is off of X's runqueue, only that the thread has executed __kthread_parkme and set the completion. cpuhp/X may have been preempted out before calling schedule() to voluntarily sleep. takedown_cpu proceeds to run the stopper thread on CPU_X which promptly migrates off the still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity mask to something other than CPU_X alone. This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at some later point. But if that doesn't happen (for example, if there's an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call for CPU_X will race with the still-on-rq condition. Even now we're functionally OK because there is a wait_task_inactive in the kthread_unpark(), BUT the following happens: [ 12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002 [ 12.472749] Modules linked in: [ 12.472756] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680 [ 12.472758] Hardware name: X [ 12.472760] Call trace: [ 12.472773] [] dump_backtrace+0x0/0x198 [ 12.472777] [] show_stack+0x14/0x1c [ 12.472781] [] dump_stack+0x8c/0xac [ 12.472786] [] __schedule_bug+0x54/0x70 [ 12.472792] [] __schedule+0x6b4/0x928 [ 12.472794] [] schedule+0x3c/0xa0 [ 12.472797] [] schedule_hrtimeout_range_clock+0x80/0xec [ 12.472799] [] schedule_hrtimeout+0x18/0x20 [ 12.472803] [] wait_task_inactive+0x1a0/0x1a4 [ 12.472806] [] __kthread_bind_mask+0x20/0x7c [ 12.472809] [] __kthread_bind+0x28/0x30 [ 12.472811] [] __kthread_unpark+0x5c/0x60 [ 12.472814] [] kthread_unpark+0x24/0x2c [ 12.472818] [] cpuhp_online_idle+0x50/0x90 [ 12.472822] [] cpu_startup_entry+0x3c/0x1d4 [ 12.472824] [] secondary_start_kernel+0x164/0x1b4 Since the kthread_unpark is invoked from a preemption-disabled context, wait_task_inactive's action of invoking schedule is invalid, causing the splat. Note that kthread_bind_mask is correctly attempting to re-set the affinity mask since cpuhp is a per-cpu smpboot thread. Instead of adding an expensive wait_task_inactive inside kthread_park() or trying to muck with the hotplug code, let's just ensure that the completion variable and the schedule happen atomically inside __kthread_parkme. This focuses the fix to the hotplug requirement alone, and removes the unnecessary migration of cpuhp/X. Signed-off-by: Vikram Mulukutla --- kernel/kthread.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index 26db528..7ad3354 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self) { __set_current_state(TASK_PARKED); while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) { + /* +* Why the preempt_disable? +* Hotplug needs to ensure that 'self' is off of the runqueue +* as well, before scheduling the stopper thread that will +* migrate tasks off of the runqeue that 'self' was running on. +* This avoids unnecessary migration work and also ensures that +* kthread_unpark in the cpu_up path doesn't race with +* __kthread_parkme. +*/ + preempt_disable(); if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags)) complete(&self->parked); - schedule(); + schedule_preempt_disabled(); + preempt_enable(); __set_current_state(TASK_PARKED); } clear_bit(KTHREAD_IS_PARKED, &self->flags);
Re: [PATCH v9 1/5] firmware: add extensible driver data params
On 6/26/2017 10:33 AM, Luis R. Rodriguez wrote: On Sat, Jun 24, 2017 at 02:39:51PM +0200, Greg KH wrote: On Sat, Jun 24, 2017 at 02:48:28AM +0200, Luis R. Rodriguez wrote: On Fri, Jun 23, 2017 at 04:09:29PM -0700, Linus Torvalds wrote: On Fri, Jun 23, 2017 at 3:43 PM, Luis R. Rodriguez wrote: Ah, yes! Here is what I believe seems to be the *crux* issue of these patch series and I'm happy we have finally landed on it. Yes, indeed the new API proposed here provides more flexibility, and it does so by embracing a "data driven" API Vs the traditional procedural APIs we have seen for *the firmware API*. This has been going on forever. Everybody hates your data-driven one. Before you, the only person who had expressed disdain here was Greg. Very few people actually review code, you know that. Using that logic, then of course "everybody" was *very* fitting ;) Then again others who actually are working on extending the firmware API (Yi Li), or maintaining vendor trees (Vikram), did express their opinions on the current codebase and their appreciate for the changes I made, however this went selectively unnoticed. Things like that may be ok as an internal implementation, but even there it's questionable if it then means a big disconnect between what people actually use (the normal functional model) and the implementation. A vendor tree implemented their *own* solution and were willing to maintain it despite this likely making it hard to port stable fixes. That I think says a lot for a need... What vendor tree? Where was it shipped? The msm-3.18 kernel [0], so assuming this goes to mobile devices, this could mean millions of devices. https://source.codeaurora.org/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=msm-3.18&id=7aa7efd3c150840369739893a84bd1d9f9774319 Why was it external and how is it different from your patches? As is typical with external trees -- it would seem Vikram actually wrote the original request_firmware_into_buf() API for the msm tree. It contained the fw_desc changes. Stephen Boyd seems to have worked on the actual upstreaming effort and he dropped that fw_desc effort from the upstreaming effort. Vikarm noted he had had a similar internal discussion with Stephen Stephen Boyd as I am with you on this thread back when request_firmware_into_buf() was being upstreamed [0]. He noted that back then reason for this proposed change was that "the number of things being passed around between layers of functions inside firmware_class seemed a bit untenable". I will note around that time I had proposed a similar change using the fw_desc name, it was only later that this renamed to a params postfix as Linus did not like the descriptor name. [0] https://lkml.kernel.org/r/20ac6fa65c8ff4ef83386aa1e8d5c...@codeaurora.org The only difference is that his patch does only modifying the private members of the internal API and routines from my patch 1/5, and he kept the "descriptor" name Linus disliked a while ago. This is precisely why AKASHI noted I could split up my patch 1 in more ways in this series to help *patch review*. Was it used because your version has taken so long to be submitted/reviwed? Vikram would have a better idea as he is the one who authored it, but it would seem this effort was in parallel to my own at that time. I must shamefully admit that the story is a bit older - the patch I originally worked on was on a v3.4 based tree. We had been forward porting it until Stephen Boyd was kind enough (or tired of it) to take time out of his clock maintainer-ship and upstream the request_firmware_into_buf API. At that point of time it seemed that the 'desc' approach was unnecessary, and I agreed. So Luis's series came in much later and wasn't a factor in forward-porting the patches. While it does seem that the _internal_ implementation of firmware_class can be a bit friendlier to adding the features that are on their way, I can't say the same about the API being exposed to drivers in mainline; maintainers and folks with more experience in kernel API evolution are better equipped to answer that question. Thanks, Vikram
Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme
Ping. This is happening on x86 across suspend/resume too. https://bugs.freedesktop.org/show_bug.cgi?id=100113 On 2017-06-26 16:03, Vikram Mulukutla wrote: correcting Thomas Gleixner's email address. s/linuxtronix/linutronix On 6/26/2017 3:18 PM, Vikram Mulukutla wrote: kthread_park waits for the target kthread to park itself with __kthread_parkme using a completion variable. __kthread_parkme - which is invoked by the target kthread - sets the completion variable before calling schedule() to voluntarily get itself off of the runqueue. Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2 4/6] cpufreq: schedutil: update CFS util only if used
On 2017-07-07 23:14, Joel Fernandes wrote: Hi Vikram, On Thu, Jul 6, 2017 at 11:44 PM, Vikram Mulukutla wrote: On 2017-07-04 10:34, Patrick Bellasi wrote: Currently the utilization of the FAIR class is collected before locking the policy. Although that should not be a big issue for most cases, we also don't really know how much latency there can be between the utilization reading and its usage. Let's get the FAIR utilization right before its usage to be better in sync with the current status of a CPU. Signed-off-by: Patrick Bellasi Given that the utilization update hooks are called with the per-cpu rq lock held (for all classes), I don't think PELT utilization can change throughout the lifetime of the cpufreq_update_{util,this_cpu} call? Even with Viresh's remote cpu callback series we'd still have to hold the rq lock across cpufreq_update_util.. what can change today is 'max' (arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch is still needed for that reason I think? I didn't follow, Could you elaborate more why you think the patch helps with the case where max changes while the per-cpu rq lock held? So going by Patrick's commit text, the concern was a TOC/TOU problem, but since we agree that CFS utilization can't change within an rq-locked critical section, the only thing that can change is 'max'. So you might be the 8th cpu in line waiting for the sg-policy lock, and after you get the lock, the max may be outdated. But come to think of it max changes should be triggering schedutil updates and those shouldn't be rate-throttled, so maybe we don't need this at all? It's still somewhat future-proof in case there is some stat that we read in sugov_get_util that can be updated asynchronously. However we can put it in when we need it... thanks, -Joel -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme
On 7/4/2017 9:07 AM, Peter Zijlstra wrote: On Mon, Jun 26, 2017 at 03:18:03PM -0700, Vikram Mulukutla wrote: kernel/kthread.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index 26db528..7ad3354 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self) { __set_current_state(TASK_PARKED); while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) { + /* +* Why the preempt_disable? +* Hotplug needs to ensure that 'self' is off of the runqueue +* as well, before scheduling the stopper thread that will +* migrate tasks off of the runqeue that 'self' was running on. +* This avoids unnecessary migration work and also ensures that +* kthread_unpark in the cpu_up path doesn't race with +* __kthread_parkme. +*/ + preempt_disable(); if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags)) complete(&self->parked); + schedule_preempt_disabled(); This is broken. schedule_preempt_disable() doesn't guarantee no preemptions, just makes it less likely. Right, the API just informs the scheduler that the calling thread wishes to have preemption disabled when the API returns. I thought it was going to guarantee no preemption until the thread is actually off of the runqueue, but I see the window where an interrupt might preempt. Doh. Separate from this hotplug problem, would it be entirely moronic to have the API disable and enable local interrupts across that short window? I suppose there's no one that needs this sort of thing so.. no? + preempt_enable(); __set_current_state(TASK_PARKED); } clear_bit(KTHREAD_IS_PARKED, &self->flags); Thanks, Vikram
Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme
On 7/4/2017 12:49 PM, Thomas Gleixner wrote: On Mon, 26 Jun 2017, Vikram Mulukutla wrote: On 6/26/2017 3:18 PM, Vikram Mulukutla wrote: kthread_park waits for the target kthread to park itself with __kthread_parkme using a completion variable. __kthread_parkme - which is invoked by the target kthread - sets the completion variable before calling schedule() to voluntarily get itself off of the runqueue. This causes an interesting race in the hotplug path. takedown_cpu() invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that cpuhp/X is off of X's runqueue, only that the thread has executed __kthread_parkme and set the completion. cpuhp/X may have been preempted out before calling schedule() to voluntarily sleep. takedown_cpu proceeds to run the stopper thread on CPU_X which promptly migrates off the still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity mask to something other than CPU_X alone. This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at some later point. But if that doesn't happen (for example, if there's an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call for CPU_X will race with the still-on-rq condition. Even now we're functionally OK because there is a wait_task_inactive in the kthread_unpark(), BUT the following happens: [ 12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002 Thats not the worst problem. We could simply enable preemption there, but the real issue is that this is the idle task of the upcoming CPU which is not supposed to schedule in the first place. So no, your 'fix' is just papering over the underlying issue. And yes, the moron who did not think about wait_task_inactive() being called via kthread_unpark() -> kthread_bind() is me. I'm testing a proper fix for it right now. Will post later. Thanks, it did totally wrong to have any sort of scheduling in the idle thread as the subsequent warnings do indicate, but I didn't feel confident enough to mess around with the hotplug state machine. Thanks, tglx Thanks, Vikram
Re: [PATCH v2 4/6] cpufreq: schedutil: update CFS util only if used
On 2017-07-04 10:34, Patrick Bellasi wrote: Currently the utilization of the FAIR class is collected before locking the policy. Although that should not be a big issue for most cases, we also don't really know how much latency there can be between the utilization reading and its usage. Let's get the FAIR utilization right before its usage to be better in sync with the current status of a CPU. Signed-off-by: Patrick Bellasi Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Rafael J. Wysocki Cc: Viresh Kumar Cc: linux-kernel@vger.kernel.org Cc: linux...@vger.kernel.org --- kernel/sched/cpufreq_schedutil.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 98704d8..df433f1 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -308,10 +308,9 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, if (unlikely(current == sg_policy->thread)) return; - sugov_get_util(&util, &max); - raw_spin_lock(&sg_policy->update_lock); + sugov_get_util(&util, &max); sg_cpu->util = util; sg_cpu->max = max; Given that the utilization update hooks are called with the per-cpu rq lock held (for all classes), I don't think PELT utilization can change throughout the lifetime of the cpufreq_update_{util,this_cpu} call? Even with Viresh's remote cpu callback series we'd still have to hold the rq lock across cpufreq_update_util.. what can change today is 'max' (arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch is still needed for that reason I think? Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the control CPU
On 2017-07-04 13:20, Thomas Gleixner wrote: Vikram reported the following backtrace: BUG: scheduling while atomic: swapper/7/0/0x0002 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680 schedule schedule_hrtimeout_range_clock schedule_hrtimeout wait_task_inactive __kthread_bind_mask __kthread_bind __kthread_unpark kthread_unpark cpuhp_online_idle cpu_startup_entry secondary_start_kernel He analyzed correctly that a parked cpu hotplug thread of an offlined CPU was still on the runqueue when the CPU came back online and tried to unpark it. This causes the thread which invoked kthread_unpark() to call wait_task_inactive() and subsequently schedule() with preemption disabled. His proposed workaround was to "make sure" that a parked thread has scheduled out when the CPU goes offline, so the situation cannot happen. But that's still wrong because the root cause is not the fact that the percpu thread is still on the runqueue and neither that preemption is disabled, which could be simply solved by enabling preemption before calling kthread_unpark(). The real issue is that the calling thread is the idle task of the upcoming CPU, which is not supposed to call anything which might sleep. The moron, who wrote that code, missed completely that kthread_unpark() might end up in schedule(). Agreed. Presumably we are still OK with the cpu hotplug thread being migrated off to random CPUs and its unfinished kthread_parkme racing with a subsequent unpark? The cpuhp/X thread ends up on a random rq if it can't do schedule() in time because migration/X yanks it off of the dying CPU X. Apart from slightly disconcerting traces showing cpuhp/X momentarily executing on CPU Y, there's no problem that I can see of course. Probably worth mentioning that the following patch from Junaid Shahid seems to indicate that such a race doesn't always work out as desired: https://lkml.org/lkml/2017/4/28/755 The solution is simpler than expected. The thread which controls the hotplug operation is waiting for the CPU to call complete() on the hotplug state completion. So the idle task of the upcoming CPU can set its state to CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the control task on a different CPU, which then can safely do the unpark and kick the now unparked hotplug thread of the upcoming CPU to complete the bringup to the final target state. Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully up") Reported-by: Vikram Mulukutla Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Sebastian Siewior Cc: ru...@rustcorp.com.au Cc: t...@kernel.org Cc: a...@linux-foundation.org Cc: sta...@vger.kernel.org --- kernel/cpu.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -271,11 +271,25 @@ void cpu_hotplug_enable(void) EXPORT_SYMBOL_GPL(cpu_hotplug_enable); #endif /* CONFIG_HOTPLUG_CPU */ +static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st); + static int bringup_wait_for_ap(unsigned int cpu) { struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + /* Wait for the CPU to reach IDLE_ONLINE */ wait_for_completion(&st->done); + BUG_ON(!cpu_online(cpu)); + + /* Unpark the stopper thread and the hotplug thread of the target cpu */ + stop_machine_unpark(cpu); + kthread_unpark(st->thread); + + /* Should we go further up ? */ + if (st->target > CPUHP_AP_ONLINE_IDLE) { + __cpuhp_kick_ap_work(st); + wait_for_completion(&st->done); + } return st->result; } @@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu) irq_unlock_sparse(); if (ret) return ret; - ret = bringup_wait_for_ap(cpu); - BUG_ON(!cpu_online(cpu)); - return ret; + return bringup_wait_for_ap(cpu); } /* @@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp void cpuhp_online_idle(enum cpuhp_state state) { struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state); - unsigned int cpu = smp_processor_id(); /* Happens for the boot cpu */ if (state != CPUHP_AP_ONLINE_IDLE) return; st->state = CPUHP_AP_ONLINE_IDLE; - - /* Unpark the stopper thread and the hotplug thread of this cpu */ - stop_machine_unpark(cpu); - kthread_unpark(st->thread); - - /* Should we go further up ? */ - if (st->target > CPUHP_AP_ONLINE_IDLE) - __cpuhp_kick_ap_work(st); - else - complete(&st->done); + complete(&st->done); } /* Requires cpu_add_remove_lock to be held */ Nice and clean otherwise. Channagoud was instrumental in collecting data, theorizing with me and testing your fix, so if the
Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the control CPU
On 2017-07-07 00:47, Vikram Mulukutla wrote: On 2017-07-04 13:20, Thomas Gleixner wrote: Vikram reported the following backtrace: BUG: scheduling while atomic: swapper/7/0/0x0002 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680 schedule schedule_hrtimeout_range_clock schedule_hrtimeout wait_task_inactive __kthread_bind_mask __kthread_bind __kthread_unpark kthread_unpark cpuhp_online_idle cpu_startup_entry secondary_start_kernel He analyzed correctly that a parked cpu hotplug thread of an offlined CPU was still on the runqueue when the CPU came back online and tried to unpark it. This causes the thread which invoked kthread_unpark() to call wait_task_inactive() and subsequently schedule() with preemption disabled. His proposed workaround was to "make sure" that a parked thread has scheduled out when the CPU goes offline, so the situation cannot happen. But that's still wrong because the root cause is not the fact that the percpu thread is still on the runqueue and neither that preemption is disabled, which could be simply solved by enabling preemption before calling kthread_unpark(). The real issue is that the calling thread is the idle task of the upcoming CPU, which is not supposed to call anything which might sleep. The moron, who wrote that code, missed completely that kthread_unpark() might end up in schedule(). Agreed. Presumably we are still OK with the cpu hotplug thread being migrated off to random CPUs and its unfinished kthread_parkme racing with a subsequent unpark? The cpuhp/X thread ends up on a random rq if it can't do schedule() in time because migration/X yanks it off of the dying CPU X. Apart from slightly disconcerting traces showing cpuhp/X momentarily executing on CPU Y, there's no problem that I can see of course. Probably worth mentioning that the following patch from Junaid Shahid seems to indicate that such a race doesn't always work out as desired: https://lkml.org/lkml/2017/4/28/755 Ah, Junaid's problem/patch wouldn't be relevant in the hotplug case because of the completion I think. The solution is simpler than expected. The thread which controls the hotplug operation is waiting for the CPU to call complete() on the hotplug state completion. So the idle task of the upcoming CPU can set its state to CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the control task on a different CPU, which then can safely do the unpark and kick the now unparked hotplug thread of the upcoming CPU to complete the bringup to the final target state. Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully up") Reported-by: Vikram Mulukutla Signed-off-by: Thomas Gleixner Cc: Peter Zijlstra Cc: Sebastian Siewior Cc: ru...@rustcorp.com.au Cc: t...@kernel.org Cc: a...@linux-foundation.org Cc: sta...@vger.kernel.org --- kernel/cpu.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -271,11 +271,25 @@ void cpu_hotplug_enable(void) EXPORT_SYMBOL_GPL(cpu_hotplug_enable); #endif /* CONFIG_HOTPLUG_CPU */ +static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st); + static int bringup_wait_for_ap(unsigned int cpu) { struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); + /* Wait for the CPU to reach IDLE_ONLINE */ wait_for_completion(&st->done); + BUG_ON(!cpu_online(cpu)); + + /* Unpark the stopper thread and the hotplug thread of the target cpu */ + stop_machine_unpark(cpu); + kthread_unpark(st->thread); + + /* Should we go further up ? */ + if (st->target > CPUHP_AP_ONLINE_IDLE) { + __cpuhp_kick_ap_work(st); + wait_for_completion(&st->done); + } return st->result; } @@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu) irq_unlock_sparse(); if (ret) return ret; - ret = bringup_wait_for_ap(cpu); - BUG_ON(!cpu_online(cpu)); - return ret; + return bringup_wait_for_ap(cpu); } /* @@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp void cpuhp_online_idle(enum cpuhp_state state) { struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state); - unsigned int cpu = smp_processor_id(); /* Happens for the boot cpu */ if (state != CPUHP_AP_ONLINE_IDLE) return; st->state = CPUHP_AP_ONLINE_IDLE; - - /* Unpark the stopper thread and the hotplug thread of this cpu */ - stop_machine_unpark(cpu); - kthread_unpark(st->thread); - - /* Should we go further up ? */ - if (st->target > CPUHP_AP_ONLINE_IDLE) - __cpuhp_kick_ap_work(st); - else - complete(&st->done); + complete(&st->done); } /* Require
Re: cpuidle and cpufreq coupling?
On 7/20/2017 3:56 PM, Florian Fainelli wrote: On 07/20/2017 07:45 AM, Peter Zijlstra wrote: Can your ARM part change OPP without scheduling? Because (for obvious reasons) the idle thread is not supposed to block. I think it should be able to do that, but I am not sure that if I went through the cpufreq API it would be that straight forward so I may have to re-implement some of the frequency scaling logic outside of cpufreq (or rather make the low-level parts some kind of library I guess). I think I can safely mention that some of our non-upstream idle drivers in the past have invoked low level clock drivers to atomically switch CPUs to low frequency OPPs, with no interaction whatsoever with cpufreq. It was maintainable since both the idle and clock drivers were qcom-specific. However this is no longer necessary in recent designs and I really hope we never need to do this again... We didn't have to do a voltage switch and just PLL or mux work so this was doable. I'm guessing your atomic switching also allows voltage reduction? If your architecture allows another CPU to change the entering-idle CPU's frequency, synchronization will be necessary as well - this is where it can get a bit tricky. Thanks, Vikram
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
Hi Will, On 2017-08-25 12:48, Vikram Mulukutla wrote: Hi Will, On 2017-08-15 11:40, Will Deacon wrote: Hi Vikram, On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote: On 2017-07-31 06:13, Will Deacon wrote: >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote: >>On 2017-07-28 02:28, Will Deacon wrote: >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: >>> >>This does seem to help. Here's some data after 5 runs with and without >>the >>patch. > >Blimey, that does seem to make a difference. Shame it's so ugly! Would you >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had >it set to 1 in the diff I posted, but that might be higher than >optimal. >It would be interested to see if it correlates with num_possible_cpus() >for the highly contended case. > >Will Sorry for the late response - I should hopefully have some more data with different thresholds before the week is finished or on Monday. Did you get anywhere with the threshold heuristic? Will Here's some data from experiments that I finally got to today. I decided to recompile for every value of the threshold. Was doing a binary search of sorts and then started reducing by orders of magnitude. There pairs of rows here: Well here's something interesting. I tried a different platform and found that the workaround doesn't help much at all, similar to Qiao's observation on his b.L chipset. Something to do with the WFE implementation or event-stream? I modified your patch to use a __delay(1) in place of the WFEs and this was the result (still with the 10k threshold). The worst-case lock time for cpu0 drastically improves. Given that cpu0 re-enables interrupts between each lock attempt in my test case, I think the lock count matters less here. cpu_relax() patch with WFEs (original workaround): (pairs of rows, first row is with c0 at 300Mhz, second with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz max time is in microseconds) --| c0 max time| c0 lock count| c4 max time| c4 lock count| --| 999843|25| 2| 12988498| -> c0/cpu0 at 300Mhz 0| 8421132| 1| 9152979| -> c0/cpu0 at 1.9GHz --| 999860| 160| 2| 12963487| 1| 8418492| 1| 9158001| --| 999381| 734| 2| 12988636| 1| 8387562| 1| 9128056| --| 989800| 750| 3| 12996473| 1| 8389091| 1| 9112444| --| cpu_relax() patch with __delay(1): (pairs of rows, first row is with c0 at 300Mhz, second with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz. max time is in microseconds) --| c0 max time| c0 lock count| c4 max time| c4 lock count| --| 7703| 1532|2| 13035203| -> c0/cpu0 at 300Mhz 1| 8511686|1| 8550411| -> c0/cpu0 at 1.9GHz --| 7801| 1561|2| 13040188| 1| 8553985|1| 8609853| --| 3953| 1576|2| 13049991| 1| 8576370|1| 8611533| --| 3953| 1557|2| 13030553| 1| 8509020|1| 8543883| --| I should also note that my earlier kernel was 4.9-stable based and the one above was on a 4.4-stable based kernel. Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
Hi Will, On 2017-08-15 11:40, Will Deacon wrote: Hi Vikram, On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote: On 2017-07-31 06:13, Will Deacon wrote: >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote: >>On 2017-07-28 02:28, Will Deacon wrote: >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: >>> >>This does seem to help. Here's some data after 5 runs with and without >>the >>patch. > >Blimey, that does seem to make a difference. Shame it's so ugly! Would you >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had >it set to 1 in the diff I posted, but that might be higher than >optimal. >It would be interested to see if it correlates with num_possible_cpus() >for the highly contended case. > >Will Sorry for the late response - I should hopefully have some more data with different thresholds before the week is finished or on Monday. Did you get anywhere with the threshold heuristic? Will Here's some data from experiments that I finally got to today. I decided to recompile for every value of the threshold. Was doing a binary search of sorts and then started reducing by orders of magnitude. There pairs of rows here: Row1 is with cpu0 (little) at 300MHz and cpu4 at 1.9Ghz Row2 is with cpu0 (little) at 1.5GHz and cpu4 at 1.9Ghz It looks like even with the threshold set to 1, we don't hit the worst case of a single instance of locking taking a long time, probably a consequence of how the test is designed? However as we reduce the threshold, the fairness in terms of how many times each CPU acquires the lock skews towards the big CPU, starting with threshold=500 If I understand the code correctly, the upper 32 bits of an ARM64 virtual address will overflow when 1 is added to it, and so we'll keep WFE'ing on every subsequent cpu_relax invoked from the same PC, until we cross the hard-coded threshold, right? CPU_RELAX_WFE_THRESHOLD = 5000 and 2500 (very similar results) = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 2|2763059|2|7323169 0|11477590|1|12110373 3|2762788|2|7329881 1|11557987|1|12042980 3|2765912|2|7308789 1|11470294|1|12120074 3|2761793|2|7333907 1|11431841|1|12155046 3|2762402|2|7328843 1|11495705|1|12096518 3|2764392|2|7308640 1|11479146|1|12111419 | CPU_RELAX_WFE_THRESHOLD = 500 = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 3|2338277|2|10052592 1|6963131|1|18103639 3|2337982|2|10037188 1|6979396|1|18082811 3|2337282|2|10052184 0|6954990|1|18109860 3|2338737|2|10039556 1|7185046|1|17809240 4|2338857|2|10027407 1|6958274|1|18111394 4|2340208|2|10031173 0|7097088|1|17921861 - = CPU_RELAX_WFE_THRESHOLD = 50 = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 4|1219792|2|18005180 0|1252767|1|25296935 4|1219312|2|18049566 1|1252625|1|25227292 4|1219884|2|18020775 1|1252363|1|25298387 4|1220862|2|18012062 1|1251827|1|25283787 4|1220489|2|18010055 0|1251729|1|25272917 3|1220088|2|18027279 0|1253264|1|25268834 - = CPU_RELAX_WFE_THRESHOLD = 10 = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 3|298604|1|23784805 0|293511|1|24604172 3|294707|2|23857487 0|292519|1|24564209 4|294199|1|23832180 0|293840|1|24593323 4|294314|1|23853353 0|293609|1|24635190 4|293802|1|23836764 0|293322|1|24553212 3|293658|1|23889801 0|292663|1|24552118 - = CPU_RELAX_WFE_THRESHOLD = 5 = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 3|173061|1|22332479 0|173759|1|23774009 3|174471|1|22342362 0|173161|1|23814466 3|173851|2|22235422 0|172734|1|23705848 2|173452|1|22255166 0|172830|1|23824301 2|173028|1|22390297 0|172336|1|23836407 3|172968|1|22285954 0|173207|1|23844900 - = CPU_RELAX_WFE_THRESHOLD = 1 = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 2|64245|1|6266848 0|77117|1|209
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
On 2017-08-25 12:48, Vikram Mulukutla wrote: If I understand the code correctly, the upper 32 bits of an ARM64 virtual address will overflow when 1 is added to it, and so we'll keep WFE'ing on every subsequent cpu_relax invoked from the same PC, until we cross the hard-coded threshold, right? Oops, misread that. Second time we enter cpu_relax from the same PC, we do a WFE. Then we stop doing the WFE until we hit the threshold using the per-cpu counter. So with a higher threshold, we wait for more cpu_relax() calls before starting the WFE again. So a lower threshold implies we should hit WFE branch sooner. It seems that since my test keeps the while loop going for a full 5 seconds, a lower threshold will obviously result in more WFEs and lower the lock-acquired-count. I guess we want a high threshold but not so high that the little CPU has to wait a while before the big CPU counts up to the threshold, is that correct? Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
Hi Will, On 2017-07-31 06:13, Will Deacon wrote: Hi Vikram, On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote: On 2017-07-28 02:28, Will Deacon wrote: >On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: > This does seem to help. Here's some data after 5 runs with and without the patch. Blimey, that does seem to make a difference. Shame it's so ugly! Would you be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had it set to 1 in the diff I posted, but that might be higher than optimal. It would be interested to see if it correlates with num_possible_cpus() for the highly contended case. Will Sorry for the late response - I should hopefully have some more data with different thresholds before the week is finished or on Monday. Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
Hi Qiao, On 2017-08-01 00:37, qiaozhou wrote: On 2017年07月31日 19:20, qiaozhou wrote: = Also apply Vikram's patch and have a test. cpu2: a53, 832MHz, cpu7: a73, 1.75Hz Without cpu_relax bodging patch = cpu2 time | cpu2 counter | cpu7 time | cpu7 counter | ==|==|===|==| 16505| 5243| 2| 12487322| 16494| 5619| 1| 12013291| 16498| 5276| 2| 11706824| 16494| 7123| 1| 12532355| 16470| 7208| 2| 11784617| = cpu2: a53, 832MHz, cpu7: a73, 1.75Hz With cpu_relax bodging patch: = cpu2 time | cpu2 counter | cpu7 time | cpu7 counter | ==|==|===|==| 3991|140714| 1| 11430528| 4018|144371| 1| 11430528| 4034|143250| 1| 11427011| 4330|147345| 1| 11423583| 4752|138273| 1| 11433241| = It has some improvements, but not so good as Vikram's data. The big core still has much more chance to acquire lock. Thanks, Vikram Thanks for your data! I'll check on one of our other platforms to see if I see similar behavior. This may have something to do with the event-stream on your platform or the A53 revision as Sudeep pointed out here [1] - something to check I suppose... Thanks, Vikram [1] - https://lkml.org/lkml/2016/11/21/458 -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
On 2017-07-28 02:28, Will Deacon wrote: On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: I think we should have this discussion now - I brought this up earlier [1] and I promised a test case that I completely forgot about - but here it is (attached). Essentially a Big CPU in an acquire-check-release loop will have an unfair advantage over a little CPU concurrently attempting to acquire the same lock, in spite of the ticket implementation. If the Big CPU needs the little CPU to make forward progress : livelock. One solution was to use udelay(1) in such loops instead of cpu_relax(), but that's not very 'relaxing'. I'm not sure if there's something we could do within the ticket spin-lock implementation to deal with this. Does bodging cpu_relax to back-off to wfe after a while help? The event stream will wake it up if nothing else does. Nasty patch below, but I'd be interested to know whether or not it helps. Will This does seem to help. Here's some data after 5 runs with and without the patch. time = max time taken to acquire lock counter = number of times lock acquired cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz Without the cpu_relax() bodging patch: = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 117893us| 2349144|2us| 6748236| 571260us| 2125651|2us| 7643264| 19780us| 2392770|2us| 5987203| 19948us| 2395413|2us| 5977286| 19822us| 2429619|2us| 5768252| 19888us| 2444940|2us| 5675657| = cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz With the cpu_relax() bodging patch: = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 3us| 2737438|2us| 6907147| 2us| 2742478|2us| 6902241| 132us| 2745636|2us| 6876485| 3us| 2744554|2us| 6898048| 3us| 2741391|2us| 6882901| = The patch also seems to have helped with fairness in general allowing more work to be done if the CPU frequencies are more closely matched (I don't know if this translates to real world performance - probably not). The counter values are higher with the patch. time = max time taken to acquire lock counter = number of times lock acquired cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz Without the cpu_relax() bodging patch: = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 2us| 5240654|1us| 5339009| 2us| 5287797| 97us| 5327073| 2us| 5237634|1us| 5334694| 2us| 5236676| 88us| 5333582| 84us| 5285880| 84us| 5329489| = cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz With the cpu_relax() bodging patch: = cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | ==|==|===|==| 140us| 10449121|1us| 11154596| 1us| 10757081|1us| 11479395| 83us| 10237109|1us| 10902557| 2us| 9871101|1us| 10514313| 2us| 9758763|1us| 10391849| = Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
On 2017-07-28 02:28, Peter Zijlstra wrote: On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: I think we should have this discussion now - I brought this up earlier [1] and I promised a test case that I completely forgot about - but here it is (attached). Essentially a Big CPU in an acquire-check-release loop will have an unfair advantage over a little CPU concurrently attempting to acquire the same lock, in spite of the ticket implementation. If the Big CPU needs the little CPU to make forward progress : livelock. This needs to be fixed in hardware. There really isn't anything the software can sanely do about it. It also doesn't have anything to do with the spinlock implementation. Ticket or not, its a fundamental problem of LL/SC. Any situation where we use atomics for fwd progress guarantees this can happen. Agreed, it seems like trying to build a fair SW protocol over unfair HW. But if we can minimally change such loop constructs to address this (all instances I've seen so far use cpu_relax) it would save a lot of hours spent debugging these problems. Lot of b.L devices out there :-) It's also possible that such a workaround may help contention performance since the big CPU may have to wait for say a tick before breaking out of that loop (the non-livelock scenario where the entire loop isn't in a critical section). The little core (or really any core) should hold on to the locked cacheline for a while and not insta relinquish it. Giving it a chance to reach the SC. Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync
cc: Sudeep Holla On 2017-07-26 18:29, qiaozhou wrote: On 2017年07月26日 22:16, Thomas Gleixner wrote: On Wed, 26 Jul 2017, qiaozhou wrote: Cc'ed ARM folks. For that particular timer case we can clear base->running_timer w/o the lock held (see patch below), but this kind of lock -> test -> unlock -> retry loops are all over the place in the kernel, so this is going to hurt you sooner than later in some other place. It's true. This is the way spinlock is used normally and widely in kernel. I'll also ask ARM experts whether we can do something to avoid or reduce the chance of such issue. ARMv8.1 has one single instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can improve and reduce the chance. I think we should have this discussion now - I brought this up earlier [1] and I promised a test case that I completely forgot about - but here it is (attached). Essentially a Big CPU in an acquire-check-release loop will have an unfair advantage over a little CPU concurrently attempting to acquire the same lock, in spite of the ticket implementation. If the Big CPU needs the little CPU to make forward progress : livelock. We've run into the same loop construct in other spots in the kernel and the reason that a real symptom is so rare is that the retry-loop on the 'Big' CPU needs to be interrupted just once by say an IRQ/FIQ and the live-lock is broken. If the entire retry loop is within an interrupt-disabled critical section then the odds of live-locking are much higher. An example of the problem on a previous kernel is here [2]. Changes to the workqueue code since may have fixed this particular instance. One solution was to use udelay(1) in such loops instead of cpu_relax(), but that's not very 'relaxing'. I'm not sure if there's something we could do within the ticket spin-lock implementation to deal with this. Note that I ran my test on a 4.9 kernel so that didn't include any spinlock implementation changes since then. The test schedules two threads, one on a big CPU and one on a little CPU. The big CPU thread does the lock/unlock/retry loop for a full 1 second with interrupts disabled, while the little CPU attempts to acquire the same loop but enabling interrupts after every successful lock+unlock. With unfairness, the little CPU may take upto 1 second (or several milliseconds at the least) just to acquire the lock once. This varies depending on the IPC difference and frequencies of the big and little ARM64 CPUs: Big cpu frequency | Little cpu frequency | Max time taken by little to acquire lock 2GHz | 1.5GHz | 133 microseconds 2GHz | 300MHz | 734 milliseconds Thanks, Vikram [1] - https://lkml.org/lkml/2016/11/17/934 [2] - https://goo.gl/uneFjt -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project From 51d6186b620a9e354a0d40af06aef1c1299ca223 Mon Sep 17 00:00:00 2001 From: Vikram Mulukutla Date: Thu, 27 Jul 2017 12:14:48 -0700 Subject: [PATCH] measure spinlock fairness across differently capable CPUs How to run this test: 1) compile and boot 2) echo 1 > /sys/module/test/parameters/run_test 3) sleep 5 4) echo 0 > /sys/module/test/parameters/run_test The test schedules two threads on two separate CPUs and attempts to acquire the same spinlock from both threads with a certain loop construct. (it assumes cpu0 is 'Little' and cpu4 is 'Big'. This can be changed in the code) After running the test, check these counters: cat /sys/module/test/parameters/big_time_us cat /sys/module/test/parameters/little_time_us If unfairness exists, little_time should be close to 1 second or a fairly large millisecond value. test.c has comments that explain why this is so. Signed-off-by: Vikram Mulukutla --- kernel/sched/Makefile | 2 +- kernel/sched/test.c | 204 ++ 2 files changed, 205 insertions(+), 1 deletion(-) create mode 100644 kernel/sched/test.c diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index f6cce95..542a1c7 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -15,7 +15,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y) CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer endif -obj-y += core.o loadavg.o clock.o cputime.o +obj-y += core.o loadavg.o clock.o cputime.o test.o obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o obj-y += wait.o swait.o completion.o idle.o obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o energy.o sched_avg.o diff --git a/kernel/sched/test.c b/kernel/sched/test.c new file mode 100644 index 000..5dd3b0d --- /dev/null +++ b/kernel/sched/test.c @@ -0,0 +1,204 @@ +/* Copyright (c) 2014-2016, The Linux Foundation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of