[PATCH] lib: spinlock_debug: Avoid livelock in do_raw_spin_lock

2012-09-13 Thread Vikram Mulukutla
The logic in do_raw_spin_lock attempts to acquire a spinlock
by invoking arch_spin_trylock in a loop with a delay between
each attempt. Now consider the following situation in a 2
CPU system:

1. CPU-0 continually acquires and releases a spinlock in a
   tight loop; it stays in this loop until some condition X
   is satisfied. X can only be satisfied by another CPU.

2. CPU-1 tries to acquire the same spinlock, in an attempt
   to satisfy the aforementioned condition X. However, it
   never sees the unlocked value of the lock because the
   debug spinlock code uses trylock instead of just lock;
   it checks at all the wrong moments - whenever CPU-0 has
   locked the lock.

Now in the absence of debug spinlocks, the architecture specific
spinlock code can correctly allow CPU-1 to wait in a "queue"
(e.g., ticket spinlocks), ensuring that it acquires the lock at
some point. However, with the debug spinlock code, livelock
can easily occur due to the use of try_lock, which obviously
cannot put the CPU in that "queue". This queueing mechanism is
implemented in both x86 and ARM spinlock code.

Note that the situation mentioned above is not hypothetical.
A real problem was encountered where CPU-0 was running
hrtimer_cancel with interrupts disabled, and CPU-1 was attempting
to run the hrtimer that CPU-0 was trying to cancel.

Address this by actually attempting arch_spin_lock once it is
suspected that there is a spinlock lockup. If we're in a
situation that is described above, the arch_spin_lock should
succeed; otherwise other timeout mechanisms (e.g., watchdog)
should alert the system of a lockup. Therefore, if there is
a genuine system problem and the spinlock can't be acquired,
the end result (irrespective of this change being present)
is the same. If there is a livelock caused by the debug code,
this change will allow the lock to be acquired, depending on
the implementation of the lower level arch specific spinlock
code.

Signed-off-by: Vikram Mulukutla 
---
 lib/spinlock_debug.c |   32 ++--
 1 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/lib/spinlock_debug.c b/lib/spinlock_debug.c
index eb10578..94624e1 100644
--- a/lib/spinlock_debug.c
+++ b/lib/spinlock_debug.c
@@ -107,23 +107,27 @@ static void __spin_lock_debug(raw_spinlock_t *lock)
 {
u64 i;
u64 loops = loops_per_jiffy * HZ;
-   int print_once = 1;
 
-   for (;;) {
-   for (i = 0; i < loops; i++) {
-   if (arch_spin_trylock(&lock->raw_lock))
-   return;
-   __delay(1);
-   }
-   /* lockup suspected: */
-   if (print_once) {
-   print_once = 0;
-   spin_dump(lock, "lockup suspected");
+   for (i = 0; i < loops; i++) {
+   if (arch_spin_trylock(&lock->raw_lock))
+   return;
+   __delay(1);
+   }
+   /* lockup suspected: */
+   spin_dump(lock, "lockup suspected");
 #ifdef CONFIG_SMP
-   trigger_all_cpu_backtrace();
+   trigger_all_cpu_backtrace();
 #endif
-   }
-   }
+
+   /*
+* In case the trylock above was causing a livelock, give the lower
+* level arch specific lock code a chance to acquire the lock. We have
+* already printed a warning/backtrace at this point. The non-debug arch
+* specific code might actually succeed in acquiring the lock. If it is
+* not successful, the end-result is the same - there is no forward
+* progress.
+*/
+   arch_spin_lock(&lock->raw_lock);
 }
 
 void do_raw_spin_lock(raw_spinlock_t *lock)
-- 
1.7.8.3

The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 1/5] firmware: add extensible driver data params

2017-06-20 Thread Vikram Mulukutla


Hi Greg, Luis,

On 2017-06-17 12:38, Greg KH wrote:

On Tue, Jun 13, 2017 at 09:40:11PM +0200, Luis R. Rodriguez wrote:

On Tue, Jun 13, 2017 at 11:05:48AM +0200, Greg KH wrote:
> On Mon, Jun 05, 2017 at 02:39:33PM -0700, Luis R. Rodriguez wrote:
> > As the firmware API evolves we keep extending functions with more arguments.
> > Stop this nonsense by proving an extensible data structure which can be used
> > to represent both user parameters and private internal parameters.
>
> Let's take a simple C function interface and make it a more complex
> data-driven interface that is impossible to understand and obviously
> understand how it is to be used and works!

The firmware codebase was already complex!


Heh, I'm not arguing with you there :)

What you have to ask yourself really is if this makes it *less 
complex* and
helps *clean things up* in a much better way than it was before. Also 
does it
allow us to *pave the way for new functionality easily*, without 
creating

further mess?


I agree, that's what I'm saying here.  I just do not see that happening
with your patch set at all.  It's adding more code, a more complex way
to interact with the subsystem, and not making driver writer lives any
easier at all that I can see.

Again, the code is now bigger, does more, with not even any real 
benefit

for existing users.


If not, what concrete alternatives do you suggest?


It's working, so leave it alone?  :)



So I wanted to note here that I had a similar internal discussion with 
Stephen
Boyd when he first upstreamed the request_firmware_into_buf API. I 
actually did

change things a bit to pass around a 'desc' structure with all the flags
and parameters in our internal vendor tree [1]. This is a sort of an 
ultra-lite
version of what Luis is trying to do - extensibility - but does not 
change the API
for driver owners. Existing APIs become wrappers around the new API. My 
primary
reason then was the number of things being passed around between layers 
of functions

inside firmware_class seemed a bit untenable.

Just like Greg, Stephen also brought up the argument about why the
unnecessary complication to the API without any measurable benefit - 
this seemed
a fair argument to me at that time. But here's my point - if Luis agrees 
that _this_
patchset is going slightly Mjolnir, perhaps a light internal rework 
inside
firmware_class might be more suitable towards the extensibility goal 
while keeping
driver API usage as simple as it is today (even if you hate my patch for 
various

reasons)?

Thanks,
Vikram

[1] - 
https://source.codeaurora.org/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=msm-3.18&id=7aa7efd3c150840369739893a84bd1d9f9774319






Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-12 Thread Vikram Mulukutla

Hi,

I came across a piece of engineering code that looked like:

preempt_disable();
/* --cut, lots of code-- */
preempt_enable_no_resched();
put_user()
preempt_disable();

(If you wish to seriously question the usage of the preempt API in this 
manner, I unfortunately have no comment since I didn't write the code.)


This particular block of code was causing lockups and crashes on a 
certain ARM64 device. The generated assembly revealed that the compiler 
was simply optimizing out the increment and decrement of the preempt 
count, allowing put_user to run without preemption enabled, causing all 
sorts of badness. Since put_user doesn't actually access the preempt 
count and translates to just a few instructions without any branching, I 
suppose that the compiler figured it was OK to optimize.


The immediate solution is to add a compiler barrier to the code above, 
but should sched_preempt_enable_no_resched have an additional compiler 
barrier after (has one before already) the preempt-count decrement to 
prevent this sort of thing?


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum, a Linux Foundation Collaborative Project


Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-13 Thread Vikram Mulukutla

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

On Thu, May 12, 2016 at 11:39:47PM -0700, Vikram Mulukutla wrote:

Hi,

I came across a piece of engineering code that looked like:

preempt_disable();
/* --cut, lots of code-- */
preempt_enable_no_resched();
put_user()
preempt_disable();

(If you wish to seriously question the usage of the preempt API in this
manner, I unfortunately have no comment since I didn't write the code.)


I'm with Thomas here, that's broken and should not be done.


Ok. I did in fact zero in on this code by replacing each instance of 
preempt_enable_no_resched with preempt_enable one by one (there were 
several uses in the driver). I will ask the original developer to 
consider using preempt_enable.





This particular block of code was causing lockups and crashes on a certain
ARM64 device. The generated assembly revealed that the compiler was simply
optimizing out the increment and decrement of the preempt count, allowing
put_user to run without preemption enabled, causing all sorts of badness.
Since put_user doesn't actually access the preempt count and translates to
just a few instructions without any branching, I suppose that the compiler
figured it was OK to optimize.

The immediate solution is to add a compiler barrier to the code above, but
should sched_preempt_enable_no_resched have an additional compiler barrier
after (has one before already) the preempt-count decrement to prevent this
sort of thing?


I think the below would be sufficient; IIRC the compiler may not combine
or elide volatile operations.

---
  include/asm-generic/preempt.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

  static __always_inline int preempt_count(void)
  {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
  }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
  {
return ¤t_thread_info()->preempt_count;
  }



Thanks Peter, this patch worked for me. The compiler no longer optimizes 
out the increment/decrement of the preempt_count.


Thanks,
Vikram



Re: Additional compiler barrier required in sched_preempt_enable_no_resched?

2016-05-14 Thread Vikram Mulukutla

On 5/14/2016 8:39 AM, Thomas Gleixner wrote:

On Fri, 13 May 2016, Vikram Mulukutla wrote:

On 5/13/2016 7:58 AM, Peter Zijlstra wrote:

diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@

   static __always_inline int preempt_count(void)
   {
-   return current_thread_info()->preempt_count;
+   return READ_ONCE(current_thread_info()->preempt_count);
   }

-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
   {
return ¤t_thread_info()->preempt_count;
   }



Thanks Peter, this patch worked for me. The compiler no longer optimizes out
the increment/decrement of the preempt_count.


I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.

We already have:

#define preempt_disable() \
do { \
 preempt_count_inc(); \
 barrier(); \
} while (0)

#define sched_preempt_enable_no_resched() \
do { \
 barrier(); \
 preempt_count_dec(); \
} while (0)

#define preempt_enable() \
do { \
 barrier(); \
 if (unlikely(preempt_count_dec_and_test())) \
 __preempt_schedule(); \
} while (0)

So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?



While I cannot claim more than a very rudimentary knowledge of 
compilers, this was the initial reaction that I had as well. But then 
the barriers are in the wrong spots for the way the code was used in the 
driver in question. preempt_disable has the barrier() _after_ the 
increment, and sched_preempt_enable_no_resched has it _before_ the 
decrement. Therefore, if one were to use preempt_enable_no_resched 
followed by preempt_disable, there is no compiler barrier between the 
decrement and the increment statements. Whether this should translate to 
such a seemingly aggressive optimization is something I'm not completely 
sure of.



There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?



I think the barrier is working as intended for the sequence of 
preempt_disable followed by preempt_enable_no_resched.



Thanks,

tglx





As a simple experiment I used gcc on x86 on the following C program. 
This was really to convince myself rather than you and Peter!


#include 

#define barrier() __asm__ __volatile__("": : :"memory")

struct thread_info {
int pc;
};

#define preempt_count() \
ti_ptr->pc

#define preempt_disable() \
do \
{ \
preempt_count() += 1; \
barrier(); \
} \
while (0)

#define preempt_enable() \
do \
{ \
barrier(); \
preempt_count() -=  1; \
} \
while (0)

struct thread_info *ti_ptr;

int main(void)
{
struct thread_info ti;
ti_ptr = &ti;
int b;

preempt_enable();
b = b + 500;
preempt_disable();

printf("b = %d\n", b);

return 0;
}

With gcc -O2 I get this (the ARM compiler behaves similarly):

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   ba f4 01 00 00  mov$0x1f4,%edx
  400480:   be 14 06 40 00  mov$0x400614,%esi
  400485:   bf 01 00 00 00  mov$0x1,%edi
  40048a:   31 c0   xor%eax,%eax
  40048c:   e8 cf ff ff ff  callq  400460 <__printf_chk@plt>
  400491:   31 c0   xor%eax,%eax
  400493:   48 83 c4 18 add$0x18,%rsp
  400497:   c3  retq

If I swap preempt_enable and preempt_disable I get this:

00400470 :
  400470:   48 83 ec 18 sub$0x18,%rsp
  400474:   48 89 25 cd 0b 20 00mov%rsp,0x200bcd(%rip)
  40047b:   83 04 24 01 addl   $0x1,(%rsp)
  40047f:   48 8b 05 c2 0b 20 00mov0x200bc2(%rip),%rax
  400486:   ba f4 01 00 00  mov$0x1f4,%edx
  40048b:   be 14 06 40 00  mov$0x400614,%esi
  400490:   bf 01 00 00 00  mov$0x1,%edi
  400495:   83 28 01subl   $0x1,(%rax)
  400498:   31 c0   xor%eax,%eax
  40049a:   e8 c1 ff ff ff  callq  400460 <__printf_chk@plt>
  40049f:   31 c0   xor%eax,%eax
  4004a1:   48 83 c4 18 add$0x18,%rsp
  4004a5:   c3  retq

Note: If I place ti_ptr on the stack, the ordering/barriers no longer 
matter, the inc/dec is always optimized out. I suppose the compiler does 
treat current_thread_info as coming from a dif

[PATCH] tracing: Fix unmapping loop in tracing_mark_write

2014-12-17 Thread Vikram Mulukutla
Commit 6edb2a8a385f0cdef51dae37ff23e74d76d8a6ce introduced
an array map_pages that contains the addresses returned by
kmap_atomic. However, when unmapping those pages, map_pages[0]
is unmapped before map_pages[1], breaking the nesting requirement
as specified in the documentation for kmap_atomic/kunmap_atomic.

This was caught by the highmem debug code present in kunmap_atomic.
Fix the loop to do the unmapping properly.

Reviewed-by: Stephen Boyd 
Reported-by: Lime Yang 
Signed-off-by: Vikram Mulukutla 
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ab76b7b..bceed34 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4931,7 +4931,7 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
*fpos += written;
 
  out_unlock:
-   for (i = 0; i < nr_pages; i++){
+   for (i = nr_pages - 1; i >= 0; i--) {
kunmap_atomic(map_page[i]);
put_page(pages[i]);
}
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking

2016-10-28 Thread Vikram Mulukutla

We propose Window-Assisted Load Tracking (WALT) as an alternative or additional
load tracking scheme in lieu of or along with PELT, one that in our estimation
better tracks task demand and CPU utilization especially for use cases on
mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better
perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations,
(specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics
have been used to guide task placement and p-state selection (load balancing
in CFS still uses PELT statistics). WALT is now present in the android-common
kernel as well. This RFC is not an attempt to supplant PELT statistics; this is
more about starting a discussion; perhaps one that may result in changes to
PELT to address the requirements listed below.

Background
__

1) Classifying a task as "heavy" faster helps in mobile world use-cases such as
scrolling a UI or browsing a web page where tasks exhibit sporadically heavy
load, allowing for instance, faster migration to a more capable CPU (assuming
the scheduler uses task util and CPU capacity as inputs during task placement
decision making, e.g EAS). Reclassification of a light task as heavy is also
important - for example, a rendering thread may change its demand depending on
what is being shown on-screen. We believe that quickly associating a task with
its necessary resources requires a more dynamic demand or utilization signal
than provided by PELT, which is subject to the geometric series and does not
differentiate between task ramp-up and ramp-down. E.g, with a 32ms halflife,
PELT takes ~74ms to reach 80% nice-0-load_avg/util_avg for a continuously
executing 100% task, and ~139ms to reach 95%.

2) Historical task demand (even if a task sleeps for say 300ms due to network
delay) is required to restore both CPU and frequency resources to meet
performance throughput as well. PELT with a 32ms 'half-life' takes just ~213ms
to decay utilization/load (effectively) zero, forgetting that task's history,
requiring the task to re-execute its workload to gain that load_avg/util_avg
again.

3) There is a need to quickly ramp up CPU frequency as a result of heavy CPU
utilization, and to reduce frequency whenever possible to save power.
Consider as an example a single 100% thread running continuously on a single
core using schedutil with a 10ms transition latency. On an i7 8-core SMP
machine with SMT, PELT takes 60-70ms to ramp to FMAX from FMIN.

4) PELT's blocked util decay implies that the underlying cpufreq governor would
have to delay dropping frequency for longer than necessary (especially for
mobile world usecases). PELT with a 32ms 'half-life' takes ~108ms to reduce
the runqueue utilization/nice-0-load from max to 10%. Blocked load/util tracking
is of course a conservative approach recognizing that load might return to
the runqueue, but one that we feel is inflexible when it comes to certain real
world use cases.

WALT vs PELT


WALT retains the per-entity properties of PELT; demand and utilization are
accounted for at the task level, and a CPU rq's utilization contribution is
the sum of its children's contributions. Wallclock time, however, is divided
into a series of windows - to be clear about what WALT tracks exactly, we
introduce the terms task demand and CPU busy-time here and show the benefits
versus PELT that apply to the four points mentioned above:

1) Task Demand - The contribution of a task's running and runnable time to a
window, averaged over N sliding windows. Note that a window has to be
*completed* before its contribution is considered. A task's demand is the
maximum of its contribution to the most recently completed window and its
average demand over the past N windows. This metric can be roughly thought of as
an unweighted load_avg without blocked load.
WALT can take as little as one window of execution to report max task demand,
although with freq and capacity invariance, this tends to be about 3 windows
with a good frequency governor (so that's about 30ms with a 10ms window). This
allows faster and more accurate task placement decisions.

2) WALT "forgets" blocked time entirely, i.e. the sliding windows exist only
when the task is on the runqueue or running. This allows rapid reclassification
of the same task as heavy after a short sleep. Thus a task does not need to
re-execute to gain its demand once more and can be migrated up to a big CPU
immediately.

3) CPU busy time - The sum of execution times of all tasks in the most recently
completed window. For the same machine described above and with schedutil, WALT
with a 10ms window takes around 30-50ms to ramp to FMAX, and with a 20ms window
takes between 60-100ms (note that it is assumed that schedutil has a transition
latency equal to or greater than the window size).

4) WALT "forgets" cpu utilization as soon as tasks are taken off of the
runqueue, and thus the cpufreq governor can choose to drop frequ

[RFC PATCH 3/3] sched: Introduce WALT hooks into core and scheduling classes

2016-10-28 Thread Vikram Mulukutla
From: Srivatsa Vaddagiri 

Add the necessary hooks to core and the various scheduling
classes that will allow WALT to track CPU utilization and
handle task migration between CPUs as well.

With CONFIG_SCHED_WALT enabled, schedutil will use WALT's cpu
utilization metric by default. This can be switched to PELT's
util_avg at runtime by the following command:

echo 0 > /proc/sys/kernel/sched_use_walt_metrics

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Vikram Mulukutla 
---
 kernel/sched/core.c | 29 -
 kernel/sched/deadline.c |  7 +++
 kernel/sched/debug.c|  9 +
 kernel/sched/fair.c |  9 +++--
 kernel/sched/rt.c   |  6 ++
 5 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44817c6..3b7f67d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -91,6 +91,8 @@
 #define CREATE_TRACE_POINTS
 #include 
 
+#include "walt.h"
+
 DEFINE_MUTEX(sched_domains_mutex);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -991,6 +993,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
p->on_rq = TASK_ON_RQ_MIGRATING;
dequeue_task(rq, p, 0);
+   walt_prepare_migrate(p, rq, true);
set_task_cpu(p, new_cpu);
raw_spin_unlock(&rq->lock);
 
@@ -998,6 +1001,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(&rq->lock);
BUG_ON(task_cpu(p) != new_cpu);
+   walt_finish_migrate(p, rq, true);
enqueue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
@@ -1257,7 +1261,9 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
 
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
+   walt_prepare_migrate(p, src_rq, true);
set_task_cpu(p, cpu);
+   walt_finish_migrate(p, dst_rq, true);
activate_task(dst_rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
@@ -2072,13 +2078,19 @@ try_to_wake_up(struct task_struct *p, unsigned int 
state, int wake_flags)
 */
smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+   raw_spin_lock(&task_rq(p)->lock);
+   walt_update_task_ravg(p, task_rq(p), TASK_WAKE, walt_ktime_clock(), 0);
+   raw_spin_unlock(&task_rq(p)->lock);
+
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;
 
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+   walt_prepare_migrate(p, task_rq(p), false);
set_task_cpu(p, cpu);
+   walt_finish_migrate(p, cpu_rq(cpu), false);
}
 #endif /* CONFIG_SMP */
 
@@ -2129,8 +2141,10 @@ static void try_to_wake_up_local(struct task_struct *p, 
struct pin_cookie cookie
 
trace_sched_waking(p);
 
-   if (!task_on_rq_queued(p))
+   if (!task_on_rq_queued(p)) {
+   walt_update_task_ravg(p, rq, TASK_WAKE, walt_ktime_clock(), 0);
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+   }
 
ttwu_do_wakeup(rq, p, 0, cookie);
if (schedstat_enabled())
@@ -2196,6 +2210,7 @@ static void __sched_fork(unsigned long clone_flags, 
struct task_struct *p)
p->se.nr_migrations = 0;
p->se.vruntime  = 0;
INIT_LIST_HEAD(&p->se.group_node);
+   walt_init_new_task_load(p);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq= NULL;
@@ -2570,6 +2585,8 @@ void wake_up_new_task(struct task_struct *p)
rq = __task_rq_lock(p, &rf);
post_init_entity_util_avg(&p->se);
 
+   walt_mark_task_starting(p);
+
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
@@ -3071,6 +3088,7 @@ void scheduler_tick(void)
update_rq_clock(rq);
curr->sched_class->task_tick(rq, curr, 0);
cpu_load_update_active(rq);
+   walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, walt_ktime_clock(), 0);
calc_global_load_tick(rq);
raw_spin_unlock(&rq->lock);
 
@@ -3322,6 +3340,7 @@ static void __sched notrace __schedule(bool preempt)
struct pin_cookie cookie;
struct rq *rq;
int cpu;
+   u64 wallclock;
 
cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -3385,6 +3404,9 @@ static void __sched notrace __schedule(bool preempt)
update_rq_clock(rq);
 
next = pick_next_task(rq, prev, cookie);
+   wallclock = walt_ktime_clock();
+   walt_update_task_ravg(prev, rq, PUT_PREV_TASK, wallclock, 0);
+   walt_update_task_ravg(ne

[RFC PATCH 1/3] sched: Introduce structures necessary for WALT

2016-10-28 Thread Vikram Mulukutla
From: Srivatsa Vaddagiri 

Add the per-task and per-runqueue data structures that
will later be used by Window Assisted Load Tracking (WALT)
to estimate task demand and CPU utilization.

Move cap_scale into sched.h as that will be needed by WALT
as well to implement frequency and capacity invariance.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Vikram Mulukutla 
---
 include/linux/sched.h | 39 +++
 kernel/sched/fair.c   |  2 --
 kernel/sched/sched.h  |  8 
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..64f8bec 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -315,6 +315,21 @@ extern char ___assert_task_state[1 - 2*!!(
 /* Task command name length */
 #define TASK_COMM_LEN 16
 
+/*
+ * These events may be replaced with a combination of existing scheduler flags
+ * provided that that doesn't make the implementation too fragile.
+ */
+enum task_event {
+   PUT_PREV_TASK   = 0,
+   PICK_NEXT_TASK  = 1,
+   TASK_WAKE   = 2,
+   TASK_MIGRATE= 3,
+   TASK_UPDATE = 4,
+   IRQ_UPDATE  = 5,
+};
+
+extern char *task_event_names[];
+
 #include 
 
 /*
@@ -1320,6 +1335,25 @@ struct sched_statistics {
 };
 #endif
 
+#ifdef CONFIG_SCHED_WALT
+
+/* ravg represents capacity scaled cpu-usage of tasks */
+struct ravg {
+   /*
+* 'mark_start' marks the most recent event for a task
+*
+* 'curr_window' represents task's cpu usage in its most recent
+* window
+*
+* 'prev_window' represents task's cpu usage in the window prior
+* to the one represented by 'curr_window'
+   */
+   u64 mark_start;
+   u32 curr_window, prev_window;
+};
+#endif
+
+
 struct sched_entity {
struct load_weight  load;   /* for load-balancing */
struct rb_node  run_node;
@@ -1480,6 +1514,11 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+
+#ifdef CONFIG_SCHED_WALT
+   struct ravg ravg;
+#endif
+
 #ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce8b244..39c826d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2674,8 +2674,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
 }
 
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc51..9bf6925 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -65,6 +65,8 @@ static inline void cpu_load_update_active(struct rq *this_rq) 
{ }
 # define scale_load_down(w)(w)
 #endif
 
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
 /*
  * Task weight (visible to users) and its load (invisible to users) have
  * independent resolution, but they should be well calibrated. We use
@@ -664,6 +666,12 @@ struct rq {
u64 max_idle_balance_cost;
 #endif
 
+#ifdef CONFIG_SCHED_WALT
+   u64 window_start;
+   u64 curr_runnable_sum;
+   u64 prev_runnable_sum;
+#endif /* CONFIG_SCHED_WALT */
+
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
 #endif
-- 
TheMan



[RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking

2016-10-28 Thread Vikram Mulukutla
From: Srivatsa Vaddagiri 

This patch implements an alternative window-based CPU utilization
tracking mechanism in the scheduler. Per task and per CPU counters are
updated with utilization statistics using a synchronized (across CPUs)
time source and a single statistic (prev_runnable_sum) is fed to schedutil
in an upcoming patch. A windowed view of time (window size determined by
walt_ravg_window) is used to determine CPU utilization.

There are two per-CPU-rq quantities maintained by Window Assisted Load Tracking
(WALT), both normalized to the max possible frequency and the max efficiency
(IPC) of that CPU (provided the arch supports the invariance APIs):

curr_runnable_sum: aggregate utilization of all tasks that executed
during the current (not yet completed) window

prev_runnable_sum: aggregate utilization of all tasks that executed
during the most recent completed window

prev_runnable_sum is the primary statistic used to guide CPU frequency in
lieu of PELT's cfs_rq->avg.util_avg. No additional policy is imposed on this
statistic, the assumption being that the consumer (e.g., schedutil) will
perform appropriate policy decisions (e.g., headroom) before deciding the
next P-state.

Corresponding to the aggregate statistics, WALT also mantains the
following stats per task:

curr_window - represents the cpu utilization of the task in its most
recently tracked window

prev_window - represents cpu utilization of task in the window prior
to the one being tracked by curr_window

WALT statistic updates are event driven, with updates occurring in
scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch),
task wakeup and during task migration. Migration simply involves removing a
tasks's curr_window and prev_window from the source CPU's curr_runnable sum
and prev_runnable_sum, and adding the per-task counters to the destination
CPU's aggregate CPU counters. Execution time in an IRQ handler is accounted
in a CPU's curr_runnable_sum statistic, provided that the CPU was also
executing the idle task for the duration of the interrupt handler.

Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU
rq has tasks blocked on IO, idle-task execution is accounted in per-task
and per-CPU counters. Setting walt_io_is_busy will also cause interrupt
handlers in the idle task to update counters as if the idle task was
executing (instead of just the interrupt handler execution time).

The major tunable provided by WALT is walt_ravg_window, which represents
window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy
(described above) is set to 0 by default.

Potential upcoming changes/improvements include: the use of sched_clock
instead of ktime_get as a time source and support for an unsynchronized
(across CPUs) time source. The event codes (PUT_PREV_TASK etc.) may also
potentially be replaced by combinations of existing flags, provided that
doesn't make the implementation too fragile.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Vikram Mulukutla 
---
 include/linux/sched/sysctl.h |   1 +
 init/Kconfig |   9 +
 kernel/sched/Makefile|   1 +
 kernel/sched/cputime.c   |  10 +-
 kernel/sched/walt.c  | 428 +++
 kernel/sched/walt.h  |  69 +++
 kernel/sysctl.c  |  11 ++
 7 files changed, 528 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/walt.c
 create mode 100644 kernel/sched/walt.h

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..7007815 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -31,6 +31,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
 extern unsigned int sysctl_numa_balancing_scan_period_min;
 extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_sched_use_walt_metrics;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..1e629ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -390,6 +390,15 @@ config IRQ_TIME_ACCOUNTING
 
  If in doubt, say N here.
 
+config SCHED_WALT
+   bool "Support window based load tracking"
+   depends on SMP
+   help
+   This feature will allow the scheduler to maintain a tunable window
+   based set of metrics for tasks and runqueues. These metrics can be
+   used to guide task placement as well as task frequency requirements
+   for cpufreq governors.
+
 config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5e59b83..41ada04 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,6 +19,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
 obj-y += 

Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking

2016-10-28 Thread Vikram Mulukutla

On 2016-10-28 00:29, Peter Zijlstra wrote:

On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:


We propose Window-Assisted Load Tracking (WALT) as an alternative or 
additional
load tracking scheme in lieu of or along with PELT, one that in our 
estimation
better tracks task demand and CPU utilization especially for use cases 
on
mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide 
better
perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) 
implementations,
(specifically on Qualcomm Snapgdragon chipsets running Android) and 
its metrics
have been used to guide task placement and p-state selection (load 
balancing
in CFS still uses PELT statistics). WALT is now present in the 
android-common

kernel as well.


And how come we only learn of it after its already shipping? Isn't that
arse backwards?


Yes, but also we were not confident that it would be close to being 
acceptable
upstream since it was intricately tied to our HMP scheduler. However now 
that
more parties including the folks at ARM are interested, and given that 
EAS

exists and schedutil was merged into mainline, we felt it the right time
to try and introduce the concept. In general we are seriously trying to 
get

more things upstream and converge.



Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking

2016-10-28 Thread Vikram Mulukutla

On 2016-10-28 00:49, Peter Zijlstra wrote:

On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
This RFC patch has been tested on live X86 machines with the following 
sanity
and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick 
Bellasi

for initial code reviews):

(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti 
graphics,
with the same frequency list as above. Running Ubuntu 16.04 on a 
v4.8.2
baseline. WALT window size was 10ms. Only deltas above 3% are 
considered

non-noise.Power measured with Intel RAPL counters)


Was this comparison done using the use_walt_metric sysctl knob?


Yes, it was. You will want to see numbers against a pure 4.8.2 without 
any

of the WALT code, correct?


Re: [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking

2016-10-28 Thread Vikram Mulukutla

On 2016-10-28 00:43, Peter Zijlstra wrote:

On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:

+u64 walt_ktime_clock(void)
+{
+   if (unlikely(walt_ktime_suspended))
+   return ktime_to_ns(ktime_last);
+   return ktime_get_ns();
+}



+static int walt_suspend(void)
+{
+   ktime_last = ktime_get();
+   walt_ktime_suspended = true;
+   return 0;
+}


No, ktime_get() will not be used in the scheduler. Imagine the joy if
that thing ends up being the HPET.


Agreed, this is an artifact from the full implementation that feeds into 
the
interactive governor, and thus both needed to use the same time source. 
It

shall go away.


Re: [RFC PATCH 2/3] sched: Introduce Window-Assisted CPU utilization Tracking

2016-10-28 Thread Vikram Mulukutla

On 2016-10-28 00:46, Peter Zijlstra wrote:

On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
+void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq, 
bool locked)

+{
+   u64 wallclock;
+   unsigned long flags;
+
+   if (!p->on_rq && p->state != TASK_WAKING)
+   return;
+
+   if (locked == false)
+   raw_spin_lock_irqsave(&dest_rq->lock, flags);
+



+
+   if (locked == false)
+   raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
+}
+
+void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq, 
bool locked)

+{
+   u64 wallclock;
+   unsigned long flags;
+
+   if (!p->on_rq && p->state != TASK_WAKING)
+   return;
+
+   if (locked == false)
+   raw_spin_lock_irqsave(&src_rq->lock, flags);
+



+
+   if (locked == false)
+   raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+}


Seriously bad style that. Please, less bonghits before writing code.


This was my bad personal attempt at eliminating double-locking from the 
original code.
This was pointed out earlier and shall go away once I can come up with a 
way to merge

this into enqeue/dequeue sans bonghits :-)


spin_lock behavior with ARM64 big.Little/HMP

2016-11-17 Thread Vikram Mulukutla

Hello,

This isn't really a bug report, but just a description of a 
frequency/IPC
dependent behavior that I'm curious if we should worry about. The 
behavior

is exposed by questionable design so I'm leaning towards don't-care.

Consider these threads running in parallel on two ARM64 CPUs running 
mainline

Linux:

(Ordering of lines between the two columns does not indicate a sequence 
of

execution. Assume flag=0 initially.)

LittleARM64_CPU @ 300MHz (e.g.A53)   |  BigARM64_CPU @ 1.5GHz (e.g. A57)
-+--
spin_lock_irqsave(s) |  local_irq_save()
/* critical section */
flag = 1 |  spin_lock(s)
spin_unlock_irqrestore(s)|  while (!flag) {
 |  spin_unlock(s)
 |  cpu_relax();
 |  spin_lock(s)
 |  }
 |  spin_unlock(s)
 |  local_irq_restore()

I see a livelock occurring where the LittleCPU is never able to acquire 
the

lock, and the BigCPU is stuck forever waiting on 'flag' to be set.

Even with ticket spinlocks, this bit of code can cause a livelock (or 
very
long delays) if BigCPU runs fast enough. Afaics this can only happen if 
the
LittleCPU is unable to put its ticket in the queue (i.e, increment the 
next

field) since the store-exclusive keeps failing.

The problem is not present on SMP, and is mitigated by adding enough
additional clock cycles between the unlock and lock in the loop running
on the BigCPU. On big.Little, if both threads are scheduled on the same
cluster within the same clock domain, the problem is avoided.

Now the infinite loop may seem like questionable design but the problem
isn't entirely hypothetical; if BigCPU calls hrtimer_cancel with
interrupts disabled, this scenario can result if the hrtimer is about
to run on a littleCPU. It's of course possible that there's just enough
intervening code for the problem to not occur. At the very least it 
seems

that loops like the one running in the BigCPU above should come with a
WARN_ON(irqs_disabled) or with a sufficient udelay() instead of the 
cpu_relax.


Thoughts?

Thanks,
Vikram


Re: [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking

2016-10-31 Thread Vikram Mulukutla

On 2016-10-28 01:49, Peter Zijlstra wrote:

On Fri, Oct 28, 2016 at 12:57:05AM -0700, Vikram Mulukutla wrote:

On 2016-10-28 00:49, Peter Zijlstra wrote:
>On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
>>This RFC patch has been tested on live X86 machines with the following
>>sanity
>>and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick
>>Bellasi
>>for initial code reviews):
>>
>>(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti
>>graphics,
>>with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
>>baseline. WALT window size was 10ms. Only deltas above 3% are considered
>>non-noise.Power measured with Intel RAPL counters)
>
>Was this comparison done using the use_walt_metric sysctl knob?

Yes, it was. You will want to see numbers against a pure 4.8.2 without 
any

of the WALT code, correct?


Yep, because with the sysctl we still run all the accounting code. So
esp things like the hackbench run are meaningless (note that even the
CONFIG thing doesn't take out everything).

Also, I think it makes sense to always (also) compare against the
"performance" governor. That way you can see the drop in absolute
performance etc..


Ok, here are some numbers. I should be able to get the rest during the 
week.
The averages are pretty close, so I figured I would include some 
percentile
numbers. PELT and WALT numbers are with schedutil. On average it seems 
we're

introducing about 0.5% overhead with the current additional accounting.

Test: perf bench sched messaging -g 50 -l 5000
+-+-+--+--+---+
| 4.8.2-walt  | Performance | Ondemand | PELT | WALT(10ms)|
+-+-+--+--+-- +
| | |  |  |   |
|  90th   | 17.077  | 17.088   | 17.088   | 17.159|
| | |  |  |   |
|  96th   | 17.117  | 17.421   | 17.198   | 17.343|
| | |  |  |   |
|  Average| 16.910  | 16.916   | 16.937   | 16.962|
| | |  |  |   |
+-+-+--+--+---+
| 4.8.2-raw   | Performance | Ondemand | PELT | WALT(10ms)|
+-+-+--+--+---+
| | |  |  |   |
|  90th   | 16.919  | 17.1694  | 16.986   | 0 |
| | |  |  |   |
|  96th   | 16.980  | 17.364   | 17.052   | 0 |
| | |  |  |   |
|  Average| 16.841  | 16.902   | 16.860   | 0 |
+-+-+--+--+---+


Re: spin_lock behavior with ARM64 big.Little/HMP

2016-11-18 Thread Vikram Mulukutla


Hi Sudeep,

Thanks for taking a look!

On 2016-11-18 02:30, Sudeep Holla wrote:

Hi Vikram,

On 18/11/16 02:22, Vikram Mulukutla wrote:

Hello,

This isn't really a bug report, but just a description of a 
frequency/IPC
dependent behavior that I'm curious if we should worry about. The 
behavior

is exposed by questionable design so I'm leaning towards don't-care.

Consider these threads running in parallel on two ARM64 CPUs running
mainline
Linux:



Are you seeing this behavior with the mainline kernel on any platforms
as we have a sort of workaround for this ?



If I understand that workaround correctly, the ARM timer event stream is 
used
to periodically wake up CPUs that are waiting in WFE, is that right? I 
think
my scenario below may be different because LittleCPU doesn't actually 
wait
on a WFE event in the loop that is trying to increment lock->next, i.e. 
it's

stuck in the following loop:

ARM64_LSE_ATOMIC_INSN(
/* LL/SC */
"   prfmpstl1strm, %3\n"
"1: ldaxr   %w0, %3\n"
"   add %w1, %w0, %w5\n"
"   stxr%w2, %w1, %3\n"
"   cbnz%w2, 1b\n",


I have been testing internal platforms; I'll try to test on something
available publicly that's b.L. In any case, the timer event stream was 
enabled

when I tried this out.

(Ordering of lines between the two columns does not indicate a 
sequence of

execution. Assume flag=0 initially.)

LittleARM64_CPU @ 300MHz (e.g.A53)   |  BigARM64_CPU @ 1.5GHz (e.g. 
A57)

-+--
spin_lock_irqsave(s) |  local_irq_save()
/* critical section */
flag = 1 |  spin_lock(s)
spin_unlock_irqrestore(s)|  while (!flag) {
 |  spin_unlock(s)
 |  cpu_relax();
 |  spin_lock(s)
 |  }
 |  spin_unlock(s)
 |  local_irq_restore()



[...]

Thanks,
Vikram


Re: spin_lock behavior with ARM64 big.Little/HMP

2017-03-29 Thread Vikram Mulukutla


Hi Sudeep,



Interesting. Just curious if this is r0p0/p1 A53 ? If so, is the errata
819472 enabled ?


Sorry for bringing this up after the loo-ong delay, but I've been 
assured that the A53 involved is > r0p1. I've also confirmed this 
problem on multiple internal platforms, and I'm pretty sure that it 
occurs on any b.L out there today. Also, we found the same problematic 
lock design used in the workqueue code in the kernel, causing the same 
livelock. It's very very rare and requires a perfect set of 
circumstances.


If it would help I can provide a unit test if you folks would be 
generous enough to test it on the latest Juno or something b.L that's 
also upstream.


Thanks,
Vikram


Re: [RFD PATCH 3/5] sched/cpufreq_schedutil: make worker kthread be SCHED_DEADLINE

2017-03-30 Thread Vikram Mulukutla




OK

So there are two pieces here.

One is that if we want *all* drivers to work with schedutil, we need to 
keep
the kthread for the ones that will never be reworked (because nobody 
cares
etc).  But then perhaps the kthread implementation may be left alone 
(because

nobody cares etc).

The second one is that there are drivers operating in-context that work 
with
schedutil already, so I don't see major obstacles to making more 
drivers work
that way.  That would be only a matter of reworking the drivers in 
question.


Thanks,
Rafael


There are some MSM platforms that do need a kthread and would love to 
use
schedutil. This is all mainly due to the point that Vincent raised; 
having

to actually wait for voltage transitions before clock switches. I can't
speak about the future, but that's the situation right now. Leaving the
kthread alone for now would be appreciated!

Thanks,
Vikram


[PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme

2017-06-26 Thread Vikram Mulukutla
kthread_park waits for the target kthread to park itself with
__kthread_parkme using a completion variable. __kthread_parkme - which is
invoked by the target kthread - sets the completion variable before
calling schedule() to voluntarily get itself off of the runqueue.

This causes an interesting race in the hotplug path. takedown_cpu()
invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before
running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that
cpuhp/X is off of X's runqueue, only that the thread has executed
__kthread_parkme and set the completion. cpuhp/X may have been preempted
out before calling schedule() to voluntarily sleep. takedown_cpu proceeds
to run the stopper thread on CPU_X which promptly migrates off the
still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity
mask to something other than CPU_X alone.

This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at
some later point. But if that doesn't happen (for example, if there's
an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call
for CPU_X will race with the still-on-rq condition. Even now we're
functionally OK because there is a wait_task_inactive in the
kthread_unpark(), BUT the following happens:

[   12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002
[   12.472749] Modules linked in:
[   12.472756] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
[   12.472758] Hardware name: X
[   12.472760] Call trace:
[   12.472773] [] dump_backtrace+0x0/0x198
[   12.472777] [] show_stack+0x14/0x1c
[   12.472781] [] dump_stack+0x8c/0xac
[   12.472786] [] __schedule_bug+0x54/0x70
[   12.472792] [] __schedule+0x6b4/0x928
[   12.472794] [] schedule+0x3c/0xa0
[   12.472797] [] schedule_hrtimeout_range_clock+0x80/0xec
[   12.472799] [] schedule_hrtimeout+0x18/0x20
[   12.472803] [] wait_task_inactive+0x1a0/0x1a4
[   12.472806] [] __kthread_bind_mask+0x20/0x7c
[   12.472809] [] __kthread_bind+0x28/0x30
[   12.472811] [] __kthread_unpark+0x5c/0x60
[   12.472814] [] kthread_unpark+0x24/0x2c
[   12.472818] [] cpuhp_online_idle+0x50/0x90
[   12.472822] [] cpu_startup_entry+0x3c/0x1d4
[   12.472824] [] secondary_start_kernel+0x164/0x1b4

Since the kthread_unpark is invoked from a preemption-disabled context,
wait_task_inactive's action of invoking schedule is invalid, causing the
splat. Note that kthread_bind_mask is correctly attempting to re-set
the affinity mask since cpuhp is a per-cpu smpboot thread.

Instead of adding an expensive wait_task_inactive inside kthread_park()
or trying to muck with the hotplug code, let's just ensure that the
completion variable and the schedule happen atomically inside
__kthread_parkme. This focuses the fix to the hotplug requirement alone,
and removes the unnecessary migration of cpuhp/X.

Signed-off-by: Vikram Mulukutla 
---
 kernel/kthread.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..7ad3354 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self)
 {
__set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+   /*
+* Why the preempt_disable?
+* Hotplug needs to ensure that 'self' is off of the runqueue
+* as well, before scheduling the stopper thread that will
+* migrate tasks off of the runqeue that 'self' was running on.
+* This avoids unnecessary migration work and also ensures that
+* kthread_unpark in the cpu_up path doesn't race with
+* __kthread_parkme.
+*/
+   preempt_disable();
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
-   schedule();
+   schedule_preempt_disabled();
+   preempt_enable();
__set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme

2017-06-26 Thread Vikram Mulukutla


correcting Thomas Gleixner's email address. s/linuxtronix/linutronix

On 6/26/2017 3:18 PM, Vikram Mulukutla wrote:

kthread_park waits for the target kthread to park itself with
__kthread_parkme using a completion variable. __kthread_parkme - which is
invoked by the target kthread - sets the completion variable before
calling schedule() to voluntarily get itself off of the runqueue.

This causes an interesting race in the hotplug path. takedown_cpu()
invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before
running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that
cpuhp/X is off of X's runqueue, only that the thread has executed
__kthread_parkme and set the completion. cpuhp/X may have been preempted
out before calling schedule() to voluntarily sleep. takedown_cpu proceeds
to run the stopper thread on CPU_X which promptly migrates off the
still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity
mask to something other than CPU_X alone.

This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at
some later point. But if that doesn't happen (for example, if there's
an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call
for CPU_X will race with the still-on-rq condition. Even now we're
functionally OK because there is a wait_task_inactive in the
kthread_unpark(), BUT the following happens:

[   12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002
[   12.472749] Modules linked in:
[   12.472756] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
[   12.472758] Hardware name: X
[   12.472760] Call trace:
[   12.472773] [] dump_backtrace+0x0/0x198
[   12.472777] [] show_stack+0x14/0x1c
[   12.472781] [] dump_stack+0x8c/0xac
[   12.472786] [] __schedule_bug+0x54/0x70
[   12.472792] [] __schedule+0x6b4/0x928
[   12.472794] [] schedule+0x3c/0xa0
[   12.472797] [] schedule_hrtimeout_range_clock+0x80/0xec
[   12.472799] [] schedule_hrtimeout+0x18/0x20
[   12.472803] [] wait_task_inactive+0x1a0/0x1a4
[   12.472806] [] __kthread_bind_mask+0x20/0x7c
[   12.472809] [] __kthread_bind+0x28/0x30
[   12.472811] [] __kthread_unpark+0x5c/0x60
[   12.472814] [] kthread_unpark+0x24/0x2c
[   12.472818] [] cpuhp_online_idle+0x50/0x90
[   12.472822] [] cpu_startup_entry+0x3c/0x1d4
[   12.472824] [] secondary_start_kernel+0x164/0x1b4

Since the kthread_unpark is invoked from a preemption-disabled context,
wait_task_inactive's action of invoking schedule is invalid, causing the
splat. Note that kthread_bind_mask is correctly attempting to re-set
the affinity mask since cpuhp is a per-cpu smpboot thread.

Instead of adding an expensive wait_task_inactive inside kthread_park()
or trying to muck with the hotplug code, let's just ensure that the
completion variable and the schedule happen atomically inside
__kthread_parkme. This focuses the fix to the hotplug requirement alone,
and removes the unnecessary migration of cpuhp/X.

Signed-off-by: Vikram Mulukutla 
---
  kernel/kthread.c | 13 -
  1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..7ad3354 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self)
  {
__set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+   /*
+* Why the preempt_disable?
+* Hotplug needs to ensure that 'self' is off of the runqueue
+* as well, before scheduling the stopper thread that will
+* migrate tasks off of the runqeue that 'self' was running on.
+* This avoids unnecessary migration work and also ensures that
+* kthread_unpark in the cpu_up path doesn't race with
+* __kthread_parkme.
+*/
+   preempt_disable();
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
-   schedule();
+   schedule_preempt_disabled();
+   preempt_enable();
__set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);





Re: [PATCH v9 1/5] firmware: add extensible driver data params

2017-06-26 Thread Vikram Mulukutla

On 6/26/2017 10:33 AM, Luis R. Rodriguez wrote:

On Sat, Jun 24, 2017 at 02:39:51PM +0200, Greg KH wrote:

On Sat, Jun 24, 2017 at 02:48:28AM +0200, Luis R. Rodriguez wrote:

On Fri, Jun 23, 2017 at 04:09:29PM -0700, Linus Torvalds wrote:
On Fri, Jun 23, 2017 at 3:43 PM, Luis R. Rodriguez 
 wrote:


Ah, yes! Here is what I believe seems to be the *crux* issue of 
these patch
series and I'm happy we have finally landed on it. Yes, indeed the 
new API
proposed here provides more flexibility, and it does so by 
embracing a
"data driven" API Vs the traditional procedural APIs we have seen 
for

*the firmware API*.


This has been going on forever. Everybody hates your data-driven 
one.


Before you, the only person who had expressed disdain here was Greg.


Very few people actually review code, you know that.


Using that logic, then of course "everybody" was *very* fitting ;)

Then again others who actually are working on extending the firmware 
API (Yi
Li), or maintaining vendor trees (Vikram), did express their opinions 
on the
current codebase and their appreciate for the changes I made, however 
this went

selectively unnoticed.


Things like that may be ok as an internal implementation, but even
there it's questionable if it then means a big disconnect between 
what

people actually use (the normal functional model) and the
implementation.


A vendor tree implemented their *own* solution and were willing to 
maintain
it despite this likely making it hard to port stable fixes. That I 
think says

a lot for a need...


What vendor tree?  Where was it shipped?


The msm-3.18 kernel [0], so assuming this goes to mobile devices, this 
could

mean millions of devices.

https://source.codeaurora.org/quic/la/kernel/msm-3.18/commit/drivers/base/firmware_class.c?h=msm-3.18&id=7aa7efd3c150840369739893a84bd1d9f9774319


Why was it external and how is it different from your patches?


As is typical with external trees -- it would seem Vikram actually 
wrote the
original request_firmware_into_buf() API for the msm tree.  It 
contained the
fw_desc changes. Stephen Boyd seems to have worked on the actual 
upstreaming

effort and he dropped that fw_desc effort from the upstreaming effort.

Vikarm noted he had had a similar internal discussion with Stephen 
Stephen Boyd
as I am with you on this thread back when request_firmware_into_buf() 
was being
upstreamed [0]. He noted that back then reason for this proposed change 
was
that "the number of things being passed around between layers of 
functions
inside firmware_class seemed a bit untenable". I will note around that 
time I
had proposed a similar change using the fw_desc name, it was only later 
that
this renamed to a params postfix as Linus did not like the descriptor 
name.


[0] 
https://lkml.kernel.org/r/20ac6fa65c8ff4ef83386aa1e8d5c...@codeaurora.org


The only difference is that his patch does only modifying the private 
members

of the internal API and routines from my patch 1/5, and he kept the
"descriptor" name Linus disliked a while ago. This is precisely why 
AKASHI
noted I could split up my patch 1 in more ways in this series to help 
*patch

review*.

Was it used because your version has taken so long to be 
submitted/reviwed?


Vikram would have a better idea as he is the one who authored it, but 
it would

seem this effort was in parallel to my own at that time.



I must shamefully admit that the story is a bit older - the patch I
originally worked on was on a v3.4 based tree. We had been forward
porting it until Stephen Boyd was kind enough (or tired of it) to take
time out of his clock maintainer-ship and upstream the
request_firmware_into_buf API. At that point of time it seemed that the
'desc' approach was unnecessary, and I agreed. So Luis's series came
in much later and wasn't a factor in forward-porting the patches.
While it does seem that the _internal_ implementation of
firmware_class can be a bit friendlier to adding the features that
are on their way, I can't say the same about the API being exposed to
drivers in mainline; maintainers and folks with more experience in
kernel API evolution are better equipped to answer that question.

Thanks,
Vikram


Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme

2017-06-28 Thread Vikram Mulukutla



Ping. This is happening on x86 across suspend/resume too.
https://bugs.freedesktop.org/show_bug.cgi?id=100113

On 2017-06-26 16:03, Vikram Mulukutla wrote:

correcting Thomas Gleixner's email address. s/linuxtronix/linutronix

On 6/26/2017 3:18 PM, Vikram Mulukutla wrote:

kthread_park waits for the target kthread to park itself with
__kthread_parkme using a completion variable. __kthread_parkme - which 
is

invoked by the target kthread - sets the completion variable before
calling schedule() to voluntarily get itself off of the runqueue.




Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH v2 4/6] cpufreq: schedutil: update CFS util only if used

2017-07-10 Thread Vikram Mulukutla

On 2017-07-07 23:14, Joel Fernandes wrote:

Hi Vikram,

On Thu, Jul 6, 2017 at 11:44 PM, Vikram Mulukutla
 wrote:

On 2017-07-04 10:34, Patrick Bellasi wrote:


Currently the utilization of the FAIR class is collected before 
locking
the policy. Although that should not be a big issue for most cases, 
we

also don't really know how much latency there can be between the
utilization reading and its usage.

Let's get the FAIR utilization right before its usage to be better in
sync with the current status of a CPU.

Signed-off-by: Patrick Bellasi 




Given that the utilization update hooks are called with the per-cpu rq 
lock
held (for all classes), I don't think PELT utilization can change 
throughout
the lifetime of the cpufreq_update_{util,this_cpu} call? Even with 
Viresh's

remote cpu callback series we'd still have to hold the rq lock across
cpufreq_update_util..  what can change today is 'max'
(arch_scale_cpu_capacity) when a cpufreq policy is shared, so the 
patch is

still needed for that reason I think?



I didn't follow, Could you elaborate more why you think the patch
helps with the case where max changes while the per-cpu rq lock held?



So going by Patrick's commit text, the concern was a TOC/TOU
problem, but since we agree that CFS utilization can't change
within an rq-locked critical section, the only thing that can
change is 'max'. So you might be the 8th cpu in line waiting
for the sg-policy lock, and after you get the lock, the max may
be outdated.

But come to think of it max changes should be triggering schedutil
updates and those shouldn't be rate-throttled, so maybe we don't
need this at all? It's still somewhat future-proof in case there
is some stat that we read in sugov_get_util that can be updated
asynchronously. However we can put it in when we need it...


thanks,

-Joel


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme

2017-07-05 Thread Vikram Mulukutla

On 7/4/2017 9:07 AM, Peter Zijlstra wrote:

On Mon, Jun 26, 2017 at 03:18:03PM -0700, Vikram Mulukutla wrote:

  kernel/kthread.c | 13 -
  1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 26db528..7ad3354 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -171,9 +171,20 @@ static void __kthread_parkme(struct kthread *self)
  {
__set_current_state(TASK_PARKED);
while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) {
+   /*
+* Why the preempt_disable?
+* Hotplug needs to ensure that 'self' is off of the runqueue
+* as well, before scheduling the stopper thread that will
+* migrate tasks off of the runqeue that 'self' was running on.
+* This avoids unnecessary migration work and also ensures that
+* kthread_unpark in the cpu_up path doesn't race with
+* __kthread_parkme.
+*/
+   preempt_disable();
if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags))
complete(&self->parked);
+   schedule_preempt_disabled();


This is broken. schedule_preempt_disable() doesn't guarantee no
preemptions, just makes it less likely.


Right, the API just informs the scheduler that the calling thread
wishes to have preemption disabled when the API returns. I thought
it was going to guarantee no preemption until the thread is actually
off of the runqueue, but I see the window where an interrupt might
preempt. Doh.

Separate from this hotplug problem, would it be entirely moronic to have
the API disable and enable local interrupts across that short window? I
suppose there's no one that needs this sort of thing so.. no?




+   preempt_enable();
__set_current_state(TASK_PARKED);
}
clear_bit(KTHREAD_IS_PARKED, &self->flags);


Thanks,
Vikram


Re: [PATCH] kthread: Atomically set completion and perform dequeue in __kthread_parkme

2017-07-05 Thread Vikram Mulukutla

On 7/4/2017 12:49 PM, Thomas Gleixner wrote:

On Mon, 26 Jun 2017, Vikram Mulukutla wrote:

On 6/26/2017 3:18 PM, Vikram Mulukutla wrote:

kthread_park waits for the target kthread to park itself with
__kthread_parkme using a completion variable. __kthread_parkme - which is
invoked by the target kthread - sets the completion variable before
calling schedule() to voluntarily get itself off of the runqueue.

This causes an interesting race in the hotplug path. takedown_cpu()
invoked for CPU_X attempts to park the cpuhp/X hotplug kthread before
running the stopper thread on CPU_X. kthread_unpark doesn't guarantee that
cpuhp/X is off of X's runqueue, only that the thread has executed
__kthread_parkme and set the completion. cpuhp/X may have been preempted
out before calling schedule() to voluntarily sleep. takedown_cpu proceeds
to run the stopper thread on CPU_X which promptly migrates off the
still-on-rq cpuhp/X thread to another cpu CPU_Y, setting its affinity
mask to something other than CPU_X alone.

This is OK - cpuhp/X may finally get itself off of CPU_Y's runqueue at
some later point. But if that doesn't happen (for example, if there's
an RT thread on CPU_Y), the kthread_unpark in a subsequent cpu_up call
for CPU_X will race with the still-on-rq condition. Even now we're
functionally OK because there is a wait_task_inactive in the
kthread_unpark(), BUT the following happens:

[   12.472745] BUG: scheduling while atomic: swapper/7/0/0x0002


Thats not the worst problem. We could simply enable preemption there, but
the real issue is that this is the idle task of the upcoming CPU which is
not supposed to schedule in the first place.

So no, your 'fix' is just papering over the underlying issue.

And yes, the moron who did not think about wait_task_inactive() being
called via kthread_unpark() -> kthread_bind() is me.

I'm testing a proper fix for it right now. Will post later.


Thanks, it did totally wrong to have any sort of scheduling in the idle
thread as the subsequent warnings do indicate, but I didn't feel
confident enough to mess around with the hotplug state machine.


Thanks,

tglx



Thanks,
Vikram


Re: [PATCH v2 4/6] cpufreq: schedutil: update CFS util only if used

2017-07-06 Thread Vikram Mulukutla

On 2017-07-04 10:34, Patrick Bellasi wrote:

Currently the utilization of the FAIR class is collected before locking
the policy. Although that should not be a big issue for most cases, we
also don't really know how much latency there can be between the
utilization reading and its usage.

Let's get the FAIR utilization right before its usage to be better in
sync with the current status of a CPU.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Viresh Kumar 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@vger.kernel.org
---
 kernel/sched/cpufreq_schedutil.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c 
b/kernel/sched/cpufreq_schedutil.c

index 98704d8..df433f1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -308,10 +308,9 @@ static void sugov_update_shared(struct
update_util_data *hook, u64 time,
if (unlikely(current == sg_policy->thread))
return;

-   sugov_get_util(&util, &max);
-
raw_spin_lock(&sg_policy->update_lock);

+   sugov_get_util(&util, &max);
sg_cpu->util = util;
sg_cpu->max = max;


Given that the utilization update hooks are called with the per-cpu rq 
lock held (for all classes), I don't think PELT utilization can change 
throughout the lifetime of the cpufreq_update_{util,this_cpu} call? Even 
with Viresh's remote cpu callback series we'd still have to hold the rq 
lock across cpufreq_update_util..  what can change today is 'max' 
(arch_scale_cpu_capacity) when a cpufreq policy is shared, so the patch 
is still needed for that reason I think?


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the control CPU

2017-07-07 Thread Vikram Mulukutla

On 2017-07-04 13:20, Thomas Gleixner wrote:

Vikram reported the following backtrace:

   BUG: scheduling while atomic: swapper/7/0/0x0002
   CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
   schedule
   schedule_hrtimeout_range_clock
   schedule_hrtimeout
   wait_task_inactive
   __kthread_bind_mask
   __kthread_bind
   __kthread_unpark
   kthread_unpark
   cpuhp_online_idle
   cpu_startup_entry
   secondary_start_kernel

He analyzed correctly that a parked cpu hotplug thread of an offlined 
CPU
was still on the runqueue when the CPU came back online and tried to 
unpark

it. This causes the thread which invoked kthread_unpark() to call
wait_task_inactive() and subsequently schedule() with preemption 
disabled.

His proposed workaround was to "make sure" that a parked thread has
scheduled out when the CPU goes offline, so the situation cannot 
happen.


But that's still wrong because the root cause is not the fact that the
percpu thread is still on the runqueue and neither that preemption is
disabled, which could be simply solved by enabling preemption before
calling kthread_unpark().

The real issue is that the calling thread is the idle task of the 
upcoming
CPU, which is not supposed to call anything which might sleep.  The 
moron,
who wrote that code, missed completely that kthread_unpark() might end 
up

in schedule().



Agreed. Presumably we are still OK with the cpu hotplug thread being
migrated off to random CPUs and its unfinished kthread_parkme racing 
with
a subsequent unpark? The cpuhp/X thread ends up on a random rq if it 
can't
do schedule() in time because migration/X yanks it off of the dying CPU 
X.
Apart from slightly disconcerting traces showing cpuhp/X momentarily 
executing

on CPU Y, there's no problem that I can see of course.

Probably worth mentioning that the following patch from Junaid Shahid 
seems

to indicate that such a race doesn't always work out as desired:
https://lkml.org/lkml/2017/4/28/755


The solution is simpler than expected. The thread which controls the
hotplug operation is waiting for the CPU to call complete() on the 
hotplug
state completion. So the idle task of the upcoming CPU can set its 
state to
CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the 
control
task on a different CPU, which then can safely do the unpark and kick 
the
now unparked hotplug thread of the upcoming CPU to complete the bringup 
to

the final target state.

Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully 
up")

Reported-by: Vikram Mulukutla 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Sebastian Siewior 
Cc: ru...@rustcorp.com.au
Cc: t...@kernel.org
Cc: a...@linux-foundation.org
Cc: sta...@vger.kernel.org

---
 kernel/cpu.c |   30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -271,11 +271,25 @@ void cpu_hotplug_enable(void)
 EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 #endif /* CONFIG_HOTPLUG_CPU */

+static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st);
+
 static int bringup_wait_for_ap(unsigned int cpu)
 {
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

+   /* Wait for the CPU to reach IDLE_ONLINE */
wait_for_completion(&st->done);
+   BUG_ON(!cpu_online(cpu));
+
+	/* Unpark the stopper thread and the hotplug thread of the target cpu 
*/

+   stop_machine_unpark(cpu);
+   kthread_unpark(st->thread);
+
+   /* Should we go further up ? */
+   if (st->target > CPUHP_AP_ONLINE_IDLE) {
+   __cpuhp_kick_ap_work(st);
+   wait_for_completion(&st->done);
+   }
return st->result;
 }

@@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu)
irq_unlock_sparse();
if (ret)
return ret;
-   ret = bringup_wait_for_ap(cpu);
-   BUG_ON(!cpu_online(cpu));
-   return ret;
+   return bringup_wait_for_ap(cpu);
 }

 /*
@@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp
 void cpuhp_online_idle(enum cpuhp_state state)
 {
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
-   unsigned int cpu = smp_processor_id();

/* Happens for the boot cpu */
if (state != CPUHP_AP_ONLINE_IDLE)
return;

st->state = CPUHP_AP_ONLINE_IDLE;
-
-   /* Unpark the stopper thread and the hotplug thread of this cpu */
-   stop_machine_unpark(cpu);
-   kthread_unpark(st->thread);
-
-   /* Should we go further up ? */
-   if (st->target > CPUHP_AP_ONLINE_IDLE)
-   __cpuhp_kick_ap_work(st);
-   else
-   complete(&st->done);
+   complete(&st->done);
 }

 /* Requires cpu_add_remove_lock to be held */


Nice and clean otherwise. Channagoud was instrumental in collecting
data, theorizing with me and testing your fix, so if the 

Re: [PATCH] smp/hotplug: Move unparking of percpu threads to the control CPU

2017-07-07 Thread Vikram Mulukutla

On 2017-07-07 00:47, Vikram Mulukutla wrote:

On 2017-07-04 13:20, Thomas Gleixner wrote:

Vikram reported the following backtrace:

   BUG: scheduling while atomic: swapper/7/0/0x0002
   CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.9.32-perf+ #680
   schedule
   schedule_hrtimeout_range_clock
   schedule_hrtimeout
   wait_task_inactive
   __kthread_bind_mask
   __kthread_bind
   __kthread_unpark
   kthread_unpark
   cpuhp_online_idle
   cpu_startup_entry
   secondary_start_kernel

He analyzed correctly that a parked cpu hotplug thread of an offlined 
CPU
was still on the runqueue when the CPU came back online and tried to 
unpark

it. This causes the thread which invoked kthread_unpark() to call
wait_task_inactive() and subsequently schedule() with preemption 
disabled.

His proposed workaround was to "make sure" that a parked thread has
scheduled out when the CPU goes offline, so the situation cannot 
happen.


But that's still wrong because the root cause is not the fact that the
percpu thread is still on the runqueue and neither that preemption is
disabled, which could be simply solved by enabling preemption before
calling kthread_unpark().

The real issue is that the calling thread is the idle task of the 
upcoming
CPU, which is not supposed to call anything which might sleep.  The 
moron,
who wrote that code, missed completely that kthread_unpark() might end 
up

in schedule().



Agreed. Presumably we are still OK with the cpu hotplug thread being
migrated off to random CPUs and its unfinished kthread_parkme racing 
with
a subsequent unpark? The cpuhp/X thread ends up on a random rq if it 
can't
do schedule() in time because migration/X yanks it off of the dying CPU 
X.
Apart from slightly disconcerting traces showing cpuhp/X momentarily 
executing

on CPU Y, there's no problem that I can see of course.

Probably worth mentioning that the following patch from Junaid Shahid 
seems

to indicate that such a race doesn't always work out as desired:
https://lkml.org/lkml/2017/4/28/755


Ah, Junaid's problem/patch wouldn't be relevant in the hotplug case 
because of the

completion I think.




The solution is simpler than expected. The thread which controls the
hotplug operation is waiting for the CPU to call complete() on the 
hotplug
state completion. So the idle task of the upcoming CPU can set its 
state to
CPUHP_AP_ONLINE_IDLE and invoke complete(). This in turn wakes the 
control
task on a different CPU, which then can safely do the unpark and kick 
the
now unparked hotplug thread of the upcoming CPU to complete the 
bringup to

the final target state.

Fixes: 8df3e07e7f21 ("cpu/hotplug: Let upcoming cpu bring itself fully 
up")

Reported-by: Vikram Mulukutla 
Signed-off-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Sebastian Siewior 
Cc: ru...@rustcorp.com.au
Cc: t...@kernel.org
Cc: a...@linux-foundation.org
Cc: sta...@vger.kernel.org

---
 kernel/cpu.c |   30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -271,11 +271,25 @@ void cpu_hotplug_enable(void)
 EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 #endif /* CONFIG_HOTPLUG_CPU */

+static void __cpuhp_kick_ap_work(struct cpuhp_cpu_state *st);
+
 static int bringup_wait_for_ap(unsigned int cpu)
 {
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

+   /* Wait for the CPU to reach IDLE_ONLINE */
wait_for_completion(&st->done);
+   BUG_ON(!cpu_online(cpu));
+
+	/* Unpark the stopper thread and the hotplug thread of the target 
cpu */

+   stop_machine_unpark(cpu);
+   kthread_unpark(st->thread);
+
+   /* Should we go further up ? */
+   if (st->target > CPUHP_AP_ONLINE_IDLE) {
+   __cpuhp_kick_ap_work(st);
+   wait_for_completion(&st->done);
+   }
return st->result;
 }

@@ -296,9 +310,7 @@ static int bringup_cpu(unsigned int cpu)
irq_unlock_sparse();
if (ret)
return ret;
-   ret = bringup_wait_for_ap(cpu);
-   BUG_ON(!cpu_online(cpu));
-   return ret;
+   return bringup_wait_for_ap(cpu);
 }

 /*
@@ -775,23 +787,13 @@ void notify_cpu_starting(unsigned int cp
 void cpuhp_online_idle(enum cpuhp_state state)
 {
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
-   unsigned int cpu = smp_processor_id();

/* Happens for the boot cpu */
if (state != CPUHP_AP_ONLINE_IDLE)
return;

st->state = CPUHP_AP_ONLINE_IDLE;
-
-   /* Unpark the stopper thread and the hotplug thread of this cpu */
-   stop_machine_unpark(cpu);
-   kthread_unpark(st->thread);
-
-   /* Should we go further up ? */
-   if (st->target > CPUHP_AP_ONLINE_IDLE)
-   __cpuhp_kick_ap_work(st);
-   else
-   complete(&st->done);
+   complete(&st->done);
 }

 /* Require

Re: cpuidle and cpufreq coupling?

2017-07-20 Thread Vikram Mulukutla

On 7/20/2017 3:56 PM, Florian Fainelli wrote:

On 07/20/2017 07:45 AM, Peter Zijlstra wrote:






Can your ARM part change OPP without scheduling? Because (for obvious
reasons) the idle thread is not supposed to block.


I think it should be able to do that, but I am not sure that if I went
through the cpufreq API it would be that straight forward so I may have
to re-implement some of the frequency scaling logic outside of cpufreq
(or rather make the low-level parts some kind of library I guess).



I think I can safely mention that some of our non-upstream idle drivers
in the past have invoked low level clock drivers to atomically switch
CPUs to low frequency OPPs, with no interaction whatsoever with cpufreq.
It was maintainable since both the idle and clock drivers were
qcom-specific. However this is no longer necessary in recent designs and
I really hope we never need to do this again...

We didn't have to do a voltage switch and just PLL or mux
work so this was doable. I'm guessing your atomic switching also allows
voltage reduction?

If your architecture allows another CPU to change the entering-idle 
CPU's

frequency, synchronization will be necessary as well - this is where it
can get a bit tricky.

Thanks,
Vikram



Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-08-28 Thread Vikram Mulukutla

Hi Will,

On 2017-08-25 12:48, Vikram Mulukutla wrote:

Hi Will,

On 2017-08-15 11:40, Will Deacon wrote:

Hi Vikram,

On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:

On 2017-07-31 06:13, Will Deacon wrote:
>On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>On 2017-07-28 02:28, Will Deacon wrote:
>>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

>>>
>>This does seem to help. Here's some data after 5 runs with and without
>>the
>>patch.
>
>Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>it set to 1 in the diff I posted, but that might be higher than
>optimal.
>It would be interested to see if it correlates with num_possible_cpus()
>for the highly contended case.
>
>Will

Sorry for the late response - I should hopefully have some more data 
with

different thresholds before the week is finished or on Monday.


Did you get anywhere with the threshold heuristic?

Will


Here's some data from experiments that I finally got to today. I 
decided
to recompile for every value of the threshold. Was doing a binary 
search

of sorts and then started reducing by orders of magnitude. There pairs
of rows here:



Well here's something interesting. I tried a different platform and 
found that
the workaround doesn't help much at all, similar to Qiao's observation 
on his b.L

chipset. Something to do with the WFE implementation or event-stream?

I modified your patch to use a __delay(1) in place of the WFEs and this 
was
the result (still with the 10k threshold). The worst-case lock time for 
cpu0
drastically improves. Given that cpu0 re-enables interrupts between each 
lock

attempt in my test case, I think the lock count matters less here.

cpu_relax() patch with WFEs (original workaround):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz max time
is in microseconds)
--|
c0 max time| c0 lock count| c4 max time| c4 lock count|
--|
 999843|25|   2|  12988498| -> c0/cpu0 at 
300Mhz
  0|   8421132|   1|   9152979| -> c0/cpu0 at 
1.9GHz

--|
 999860|   160|   2|  12963487|
  1|   8418492|   1|   9158001|
--|
 999381|   734|   2|  12988636|
  1|   8387562|   1|   9128056|
--|
 989800|   750|   3|  12996473|
  1|   8389091|   1|   9112444|
--|

cpu_relax() patch with __delay(1):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz. max time
is in microseconds)
--|
c0 max time| c0 lock count| c4 max time| c4 lock count|
--|
   7703| 1532|2|  13035203| -> c0/cpu0 at 
300Mhz
  1|  8511686|1|   8550411| -> c0/cpu0 at 
1.9GHz

--|
   7801| 1561|2|  13040188|
  1|  8553985|1|   8609853|
--|
   3953| 1576|2|  13049991|
  1|  8576370|1|   8611533|
--|
   3953| 1557|2|  13030553|
  1|  8509020|1|   8543883|
--|

I should also note that my earlier kernel was 4.9-stable based
and the one above was on a 4.4-stable based kernel.

Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-08-25 Thread Vikram Mulukutla


Hi Will,

On 2017-08-15 11:40, Will Deacon wrote:

Hi Vikram,

On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:

On 2017-07-31 06:13, Will Deacon wrote:
>On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>On 2017-07-28 02:28, Will Deacon wrote:
>>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

>>>
>>This does seem to help. Here's some data after 5 runs with and without
>>the
>>patch.
>
>Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>it set to 1 in the diff I posted, but that might be higher than
>optimal.
>It would be interested to see if it correlates with num_possible_cpus()
>for the highly contended case.
>
>Will

Sorry for the late response - I should hopefully have some more data 
with

different thresholds before the week is finished or on Monday.


Did you get anywhere with the threshold heuristic?

Will


Here's some data from experiments that I finally got to today. I decided
to recompile for every value of the threshold. Was doing a binary search
of sorts and then started reducing by orders of magnitude. There pairs 
of rows here:


Row1 is with cpu0 (little) at 300MHz and cpu4 at 1.9Ghz
Row2 is with cpu0 (little) at 1.5GHz and cpu4 at 1.9Ghz

It looks like even with the threshold set to 1, we don't hit the worst
case of a single instance of locking taking a long time, probably a 
consequence
of how the test is designed? However as we reduce the threshold, the 
fairness
in terms of how many times each CPU acquires the lock skews towards the 
big CPU,

starting with threshold=500

If I understand the code correctly, the upper 32 bits of an ARM64 
virtual
address will overflow when 1 is added to it, and so we'll keep WFE'ing 
on

every subsequent cpu_relax invoked from the same PC, until we cross the
hard-coded threshold, right?


CPU_RELAX_WFE_THRESHOLD = 5000 and 2500 (very similar results)
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
2|2763059|2|7323169
0|11477590|1|12110373

3|2762788|2|7329881
1|11557987|1|12042980

3|2765912|2|7308789
1|11470294|1|12120074

3|2761793|2|7333907
1|11431841|1|12155046

3|2762402|2|7328843
1|11495705|1|12096518

3|2764392|2|7308640
1|11479146|1|12111419
|

CPU_RELAX_WFE_THRESHOLD = 500
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
3|2338277|2|10052592
1|6963131|1|18103639

3|2337982|2|10037188
1|6979396|1|18082811

3|2337282|2|10052184
0|6954990|1|18109860

3|2338737|2|10039556
1|7185046|1|17809240

4|2338857|2|10027407
1|6958274|1|18111394

4|2340208|2|10031173
0|7097088|1|17921861
-
=

CPU_RELAX_WFE_THRESHOLD = 50
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
4|1219792|2|18005180
0|1252767|1|25296935

4|1219312|2|18049566
1|1252625|1|25227292

4|1219884|2|18020775
1|1252363|1|25298387

4|1220862|2|18012062
1|1251827|1|25283787

4|1220489|2|18010055
0|1251729|1|25272917

3|1220088|2|18027279
0|1253264|1|25268834
-
=

CPU_RELAX_WFE_THRESHOLD = 10
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
3|298604|1|23784805
0|293511|1|24604172

3|294707|2|23857487
0|292519|1|24564209

4|294199|1|23832180
0|293840|1|24593323

4|294314|1|23853353
0|293609|1|24635190

4|293802|1|23836764
0|293322|1|24553212

3|293658|1|23889801
0|292663|1|24552118
-
=

CPU_RELAX_WFE_THRESHOLD = 5
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
3|173061|1|22332479
0|173759|1|23774009

3|174471|1|22342362
0|173161|1|23814466

3|173851|2|22235422
0|172734|1|23705848

2|173452|1|22255166
0|172830|1|23824301

2|173028|1|22390297
0|172336|1|23836407

3|172968|1|22285954
0|173207|1|23844900
-
=

CPU_RELAX_WFE_THRESHOLD = 1
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
2|64245|1|6266848
0|77117|1|209

Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-08-25 Thread Vikram Mulukutla



On 2017-08-25 12:48, Vikram Mulukutla wrote:



If I understand the code correctly, the upper 32 bits of an ARM64 
virtual
address will overflow when 1 is added to it, and so we'll keep WFE'ing 
on

every subsequent cpu_relax invoked from the same PC, until we cross the
hard-coded threshold, right?



Oops, misread that. Second time we enter cpu_relax from the same PC, we
do a WFE. Then we stop doing the WFE until we hit the threshold using 
the
per-cpu counter. So with a higher threshold, we wait for more 
cpu_relax()

calls before starting the WFE again.

So a lower threshold implies we should hit WFE branch sooner. It seems
that since my test keeps the while loop going for a full 5 seconds, a 
lower
threshold will obviously result in more WFEs and lower the 
lock-acquired-count.


I guess we want a high threshold but not so high that the little CPU has
to wait a while before the big CPU counts up to the threshold, is that 
correct?


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-08-03 Thread Vikram Mulukutla

Hi Will,

On 2017-07-31 06:13, Will Deacon wrote:

Hi Vikram,

On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:

On 2017-07-28 02:28, Will Deacon wrote:
>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:



>
This does seem to help. Here's some data after 5 runs with and without 
the

patch.


Blimey, that does seem to make a difference. Shame it's so ugly! Would 
you
be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I 
had
it set to 1 in the diff I posted, but that might be higher than 
optimal.

It would be interested to see if it correlates with num_possible_cpus()
for the highly contended case.

Will


Sorry for the late response - I should hopefully have some more data 
with

different thresholds before the week is finished or on Monday.

Thanks,
Vikram
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-08-03 Thread Vikram Mulukutla


Hi Qiao,


On 2017-08-01 00:37, qiaozhou wrote:

On 2017年07月31日 19:20, qiaozhou wrote:








=
Also apply Vikram's patch and have a test.


cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
Without cpu_relax bodging patch
=
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==|==|===|==|
 16505|  5243|  2|  12487322|
 16494|  5619|  1|  12013291|
 16498|  5276|  2|  11706824|
 16494|  7123|  1|  12532355|
 16470|  7208|  2|  11784617|
=

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
With cpu_relax bodging patch:
=
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==|==|===|==|
  3991|140714|  1|  11430528|
  4018|144371|  1|  11430528|
  4034|143250|  1|  11427011|
  4330|147345|  1|  11423583|
  4752|138273|  1|  11433241|
=

It has some improvements, but not so good as Vikram's data. The big
core still has much more chance to acquire lock.


Thanks,
Vikram



Thanks for your data! I'll check on one of our other platforms to see
if I see similar behavior. This may have something to do with the
event-stream on your platform or the A53 revision as Sudeep pointed
out here [1] - something to check I suppose...

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/21/458

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-07-28 Thread Vikram Mulukutla

On 2017-07-28 02:28, Will Deacon wrote:

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:






I think we should have this discussion now - I brought this up earlier 
[1]
and I promised a test case that I completely forgot about - but here 
it

is (attached). Essentially a Big CPU in an acquire-check-release loop
will have an unfair advantage over a little CPU concurrently 
attempting
to acquire the same lock, in spite of the ticket implementation. If 
the Big

CPU needs the little CPU to make forward progress : livelock.







One solution was to use udelay(1) in such loops instead of 
cpu_relax(), but
that's not very 'relaxing'. I'm not sure if there's something we could 
do

within the ticket spin-lock implementation to deal with this.


Does bodging cpu_relax to back-off to wfe after a while help? The event
stream will wake it up if nothing else does. Nasty patch below, but I'd 
be

interested to know whether or not it helps.

Will

This does seem to help. Here's some data after 5 runs with and without 
the patch.


time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
  117893us|   2349144|2us|   6748236|
  571260us|   2125651|2us|   7643264|
   19780us|   2392770|2us|   5987203|
   19948us|   2395413|2us|   5977286|
   19822us|   2429619|2us|   5768252|
   19888us|   2444940|2us|   5675657|
=

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
   3us|   2737438|2us|   6907147|
   2us|   2742478|2us|   6902241|
 132us|   2745636|2us|   6876485|
   3us|   2744554|2us|   6898048|
   3us|   2741391|2us|   6882901|
=

The patch also seems to have helped with fairness in general
allowing more work to be done if the CPU frequencies are more
closely matched (I don't know if this translates to real world
performance - probably not). The counter values are higher
with the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
   2us|   5240654|1us|   5339009|
   2us|   5287797|   97us|   5327073|
   2us|   5237634|1us|   5334694|
   2us|   5236676|   88us|   5333582|
  84us|   5285880|   84us|   5329489|
=

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==|==|===|==|
 140us|  10449121|1us|  11154596|
   1us|  10757081|1us|  11479395|
  83us|  10237109|1us|  10902557|
   2us|   9871101|1us|  10514313|
   2us|   9758763|1us|  10391849|
=


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-07-28 Thread Vikram Mulukutla

On 2017-07-28 02:28, Peter Zijlstra wrote:

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

I think we should have this discussion now - I brought this up earlier 
[1]
and I promised a test case that I completely forgot about - but here 
it

is (attached). Essentially a Big CPU in an acquire-check-release loop
will have an unfair advantage over a little CPU concurrently 
attempting
to acquire the same lock, in spite of the ticket implementation. If 
the Big

CPU needs the little CPU to make forward progress : livelock.


This needs to be fixed in hardware. There really isn't anything the
software can sanely do about it.

It also doesn't have anything to do with the spinlock implementation.
Ticket or not, its a fundamental problem of LL/SC. Any situation where
we use atomics for fwd progress guarantees this can happen.



Agreed, it seems like trying to build a fair SW protocol over unfair HW.
But if we can minimally change such loop constructs to address this (all
instances I've seen so far use cpu_relax) it would save a lot of hours
spent debugging these problems. Lot of b.L devices out there :-)

It's also possible that such a workaround may help contention 
performance

since the big CPU may have to wait for say a tick before breaking out of
that loop (the non-livelock scenario where the entire loop isn't in a
critical section).


The little core (or really any core) should hold on to the locked
cacheline for a while and not insta relinquish it. Giving it a chance 
to

reach the SC.


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

2017-07-27 Thread Vikram Mulukutla



cc: Sudeep Holla

On 2017-07-26 18:29, qiaozhou wrote:

On 2017年07月26日 22:16, Thomas Gleixner wrote:

On Wed, 26 Jul 2017, qiaozhou wrote:

Cc'ed ARM folks.







For that particular timer case we can clear base->running_timer w/o 
the

lock held (see patch below), but this kind of

  lock -> test -> unlock -> retry

loops are all over the place in the kernel, so this is going to hurt 
you

sooner than later in some other place.

It's true. This is the way spinlock is used normally and widely in
kernel. I'll also ask ARM experts whether we can do something to avoid
or reduce the chance of such issue. ARMv8.1 has one single
instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
improve and reduce the chance.


I think we should have this discussion now - I brought this up earlier 
[1]

and I promised a test case that I completely forgot about - but here it
is (attached). Essentially a Big CPU in an acquire-check-release loop
will have an unfair advantage over a little CPU concurrently attempting
to acquire the same lock, in spite of the ticket implementation. If the 
Big

CPU needs the little CPU to make forward progress : livelock.

We've run into the same loop construct in other spots in the kernel and
the reason that a real  symptom is so rare is that the retry-loop on the 
'Big'
CPU needs to be interrupted just once by say an IRQ/FIQ and the 
live-lock
is broken. If the entire retry loop is within an interrupt-disabled 
critical

section then the odds of live-locking are much higher.

An example of the problem on a previous kernel is here [2]. Changes to 
the

workqueue code since may have fixed this particular instance.

One solution was to use udelay(1) in such loops instead of cpu_relax(), 
but
that's not very 'relaxing'. I'm not sure if there's something we could 
do

within the ticket spin-lock implementation to deal with this.

Note that I ran my test on a 4.9 kernel so that didn't include any 
spinlock
implementation changes since then. The test schedules two threads, one 
on
a big CPU and one on a little CPU. The big CPU thread does the 
lock/unlock/retry
loop for a full 1 second with interrupts disabled, while the little CPU 
attempts
to acquire the same loop but enabling interrupts after every successful 
lock+unlock.
With unfairness, the little CPU may take upto 1 second (or several 
milliseconds at
the least) just to acquire the lock once. This varies depending on the 
IPC difference

and frequencies of the big and little ARM64 CPUs:

Big cpu frequency | Little cpu frequency | Max time taken by little to 
acquire lock

2GHz  | 1.5GHz   | 133 microseconds
2GHz  | 300MHz   | 734 milliseconds

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/17/934
[2] - https://goo.gl/uneFjt

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

From 51d6186b620a9e354a0d40af06aef1c1299ca223 Mon Sep 17 00:00:00 2001
From: Vikram Mulukutla 
Date: Thu, 27 Jul 2017 12:14:48 -0700
Subject: [PATCH] measure spinlock fairness across differently capable CPUs

How to run this test:
1) compile and boot
2) echo 1 > /sys/module/test/parameters/run_test
3) sleep 5
4) echo 0 > /sys/module/test/parameters/run_test

The test schedules two threads on two separate CPUs
and attempts to acquire the same spinlock from both
threads with a certain loop construct.
(it assumes cpu0 is 'Little' and cpu4 is 'Big'. This
can be changed in the code)

After running the test, check these counters:
cat /sys/module/test/parameters/big_time_us
cat /sys/module/test/parameters/little_time_us

If unfairness exists, little_time should be close to 1 second or
a fairly large millisecond value.
test.c has comments that explain why this is so.

Signed-off-by: Vikram Mulukutla 
---
 kernel/sched/Makefile |   2 +-
 kernel/sched/test.c   | 204 ++
 2 files changed, 205 insertions(+), 1 deletion(-)
 create mode 100644 kernel/sched/test.c

diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index f6cce95..542a1c7 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -15,7 +15,7 @@ ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 CFLAGS_core.o := $(PROFILING) -fno-omit-frame-pointer
 endif
 
-obj-y += core.o loadavg.o clock.o cputime.o
+obj-y += core.o loadavg.o clock.o cputime.o test.o
 obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
 obj-y += wait.o swait.o completion.o idle.o
 obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o energy.o sched_avg.o
diff --git a/kernel/sched/test.c b/kernel/sched/test.c
new file mode 100644
index 000..5dd3b0d
--- /dev/null
+++ b/kernel/sched/test.c
@@ -0,0 +1,204 @@
+/* Copyright (c) 2014-2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of