from:"Suren Baghdasaryan"

Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

2018-07-26 Thread Suren Baghdasaryan

Sorry for the delay. Overlooked this comment...

On Tue, Jul 24, 2018 at 8:49 AM, Patrick Bellasi 
wrote:

> On 24-Jul 08:28, Suren Baghdasaryan wrote:
> > Hi Patrick. Thanks for the explanation and links. No more questions
> > from me on this one :)
>
> No problems at all!
>
> The important question is instead: does it makes sense for you too?
>

Well, it still feels unnatural to me due to the definition of the boost (at
least this much CPU bandwidth but higher should be fine).
Say I have a task which normally has specific boost and clamp requirements
(say TG.UCLAMP_MIN=20, TG.UCLAMP_MAX=80) which I want to temporarily boost
using a syscall to UCLAMP_MIN=60 (let's say a process should handle some
request and temporarily needs more CPU bandwidth). With this interface we
can clamp more than TG.UCLAMP_MAX value but we can't boost more than
TG.UCLAMP_MIN. For this usecase I would have to set TG.UCLAMP_MIN=80, then
use syscall to set SYSCALL.UCLAMP_MIN=20 to get effective UCLAMP_MIN=20 and
then set SYSCALL.UCLAMP_MIN=60 when I need that temporary boost.
To summarize, while this API does not stop me from achieving the desired
result it requires some hoop-jumping :)

>
> I think the important bits are that we are all on the same page about
> the end goals and features we like to have as well as the interface we use.
> This last has to fits best our goals and features while still being
> perfectly aligned with the frameworks we are integrating into... and
> that's still under discussion with Tejun on PATCH 08/12.
>
> Thanks again for your review!
>
> Cheers Patrick
>
> --
> #include 
>
> Patrick Bellasi
>

Re: [PATCH 0/10] psi: pressure stall information for CPU, memory, and IO v2

2018-07-27 Thread Suren Baghdasaryan

On Thu, Jul 26, 2018 at 1:07 PM, Johannes Weiner  wrote:
> On Thu, Jul 26, 2018 at 11:07:32AM +1000, Singh, Balbir wrote:
>> On 7/25/18 1:15 AM, Johannes Weiner wrote:
>> > On Tue, Jul 24, 2018 at 07:14:02AM +1000, Balbir Singh wrote:
>> >> Does the mechanism scale? I am a little concerned about how frequently
>> >> this infrastructure is monitored/read/acted upon.
>> >
>> > I expect most users to poll in the frequency ballpark of the running
>> > averages (10s, 1m, 5m). Our OOMD defaults to 5s polling of the 10s
>> > average; we collect the 1m average once per minute from our machines
>> > and cgroups to log the system/workload health trends in our fleet.
>> >
>> > Suren has been experimenting with adaptive polling down to the
>> > millisecond range on Android.
>> >
>>
>> I think this is a bad way of doing things, polling only adds to
>> overheads, there needs to be an event driven mechanism and the
>> selection of the events need to happen in user space.
>
> Of course, I'm not saying you should be doing this, and in fact Suren
> and I were talking about notification/event infrastructure.

I implemented a psi-monitor prototype which allows userspace to
specify the max PSI stall it can tolerate (in terms of % of time spent
on memory management). When that threshold is breached an event to
userspace is generated. I'm still testing it but early results look
promising. I'm planning to send it upstream when it's ready and after
the main PSI patchset is merged.

>
> You asked if this scales and I'm telling you it's not impossible to
> read at such frequencies.
>

Yes it's doable. One usecase might be to poll at a higher rate for a
short period of time immediately after the initial event is received
to clarify the short-term signal dynamics.

> Maybe you can clarify your question.
>
>> >> Why aren't existing mechanisms sufficient
>> >
>> > Our existing stuff gives a lot of indication when something *may* be
>> > an issue, like the rate of page reclaim, the number of refaults, the
>> > average number of active processes, one task waiting on a resource.
>> >
>> > But the real difference between an issue and a non-issue is how much
>> > it affects your overall goal of making forward progress or reacting to
>> > a request in time. And that's the only thing users really care
>> > about. It doesn't matter whether my system is doing 2314 or 6723 page
>> > refaults per minute, or scanned 8495 pages recently. I need to know
>> > whether I'm losing 1% or 20% of my time on overcommitted memory.
>> >
>> > Delayacct is time-based, so it's a step in the right direction, but it
>> > doesn't aggregate tasks and CPUs into compound productivity states to
>> > tell you if only parts of your workload are seeing delays (which is
>> > often tolerable for the purpose of ensuring maximum HW utilization) or
>> > your system overall is not making forward progress. That aggregation
>> > isn't something you can do in userspace with polled delayacct data.
>>
>> By aggregation you mean cgroup aggregation?
>
> System-wide and per cgroup.

Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-07-19 Thread Suren Baghdasaryan

On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
 wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference couters.
>
>  :
>(user-space changes)  :  (kernel space / scheduler)
>  :
>  SLOW PATH   : FAST PATH
>  :
> task_struct::uclamp::value   : sched/core::enqueue/dequeue
>  : cpufreq_schedutil
>  :
>   ++++ +---+
>   |  TASK  || CLAMP GROUP| |CPU CLAMPS |
>   ++++ +---+
>   |||   clamp_{min,max}  | |  clamp_{min,max}  |
>   | util_{min,max} ||  se_count  | |tasks count|
>   ++++ +---+
>  :
>+-->  :  +--->
> group_id = map(clamp_value)  :  ref_count(group_id)
>  :
>  :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
>sense to boost/limit to different frequencies,
>e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
>the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Paul Turner 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  include/linux/sched.h |  15 ++-
>  init/Kconfig  |  22 
>  kernel/sched/core.c   | 300 +-
>  3 files changed, 330 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fd8495723088..0635e8073cd3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -578,6 +578,19 @@ struct sched_dl_entity {
> struct hrtimer inactive_timer;
>  };
>
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> +   /* Utilization constraint for tasks in this group */
> +   unsigned int value;
> +   /* Utilization clamp group for this constraint */
> +   unsigned int group_id;
> +};
> +
>  union rcu_special {
> struct {
> u8  blocked;
> @@ -662,7 +675,7 @@ struct task_struct {
>
>  #ifdef CONFIG_UCLAMP_TASK
> /* Utlization clamp values for this task */
> -   int uclamp[UCLAMP_CNT];
> +   struct uclamp_seuclamp[UCLAMP_CNT];
>  #endif
>
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
> diff --git a/init/Kconfig b/init/Kconfig
> index 1d45a6877d6f..0a377ad7c166 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -601,7 +601,29 @@ config UCLAMP_TASK
>
>   If in doubt, say N.
>
> +config UCLAMP_GROUPS_COUNT
> +   int "Number of different utilization clamp values supported"
> +   range 0 127
> +   default 2
> +   depends on UCLAMP_TASK
> +   help
> + This defines the maximum number of different utilization clamp
> + values which can be concurrently enforced for each utilization
> + clamp index (i.e. minimum and maximum utilization).
> +
> + Only a limited number of clamp values are supported because:
> +   1. there are usually only few classes of workloads for which it
> +

Re: [PATCH v2 03/12] sched/core: uclamp: add CPU's clamp groups accounting

2018-07-20 Thread Suren Baghdasaryan

Hi Patrick,

On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
 wrote:
> Utilization clamping allows to clamp the utilization of a CPU within a
> [util_min, util_max] range. This range depends on the set of currently
> RUNNABLE tasks on a CPU, where each task references two "clamp groups"
> defining the util_min and the util_max clamp values to be considered for
> that task. The clamp value mapped by a clamp group applies to a CPU only
> when there is at least one task RUNNABLE referencing that clamp group.
>
> When tasks are enqueued/dequeued on/from a CPU, the set of clamp groups
> active on that CPU can change. Since each clamp group enforces a
> different utilization clamp value, once the set of these groups changes
> it can be required to re-compute what is the new "aggregated" clamp
> value to apply on that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This is to ensure that no tasks can affect the performance of other
> co-scheduled tasks which are either more boosted (i.e.  with higher
> util_min clamp) or less capped (i.e. with higher util_max clamp).
>
> Here we introduce the required support to properly reference count clamp
> groups at each task enqueue/dequeue time.
>
> Tasks have a:
>task_struct::uclamp::group_id[clamp_idx]
> indexing, for each clamp index (i.e. util_{min,max}), the clamp group in
> which they should refcount at enqueue time.
>
> CPUs rq have a:
>rq::uclamp::group[clamp_idx][group_idx].tasks
> which is used to reference count how many tasks are currently RUNNABLE on
> that CPU for each clamp group of each clamp index..
>
> The clamp value of each clamp group is tracked by
> rq::uclamp::group[][].value, thus making rq::uclamp::group[][] an
> unordered array of clamp values. However, the MAX aggregation of the
> currently active clamp groups is implemented to minimize the number of
> times we need to scan the complete (unordered) clamp group array to
> figure out the new max value. This operation indeed happens only when we
> dequeue last task of the clamp group corresponding to the current max
> clamp, and thus the CPU is either entering IDLE or going to schedule a
> less boosted or more clamped task.
> Moreover, the expected number of different clamp values, which can be
> configured at build time, is usually so small that a more advanced
> ordering algorithm is not needed. In real use-cases we expect less then
> 10 different values.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Paul Turner 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  kernel/sched/core.c  | 188 +++
>  kernel/sched/fair.c  |   4 +
>  kernel/sched/rt.c|   4 +
>  kernel/sched/sched.h |  71 
>  4 files changed, 267 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 50e749067df5..d1969931fea6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -848,9 +848,19 @@ static inline void uclamp_group_init(int clamp_id, int 
> group_id,
>  unsigned int clamp_value)
>  {
> struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
> +   struct uclamp_cpu *uc_cpu;
> +   int cpu;
>
> +   /* Set clamp group map */
> uc_map[group_id].value = clamp_value;
> uc_map[group_id].se_count = 0;
> +
> +   /* Set clamp groups on all CPUs */
> +   for_each_possible_cpu(cpu) {
> +   uc_cpu = &cpu_rq(cpu)->uclamp;
> +   uc_cpu->group[clamp_id][group_id].value = clamp_value;
> +   uc_cpu->group[clamp_id][group_id].tasks = 0;
> +   }
>  }
>
>  /**
> @@ -906,6 +916,172 @@ uclamp_group_find(int clamp_id, unsigned int 
> clamp_value)
> return group_id;
>  }
>
> +/**
> + * uclamp_cpu_update: updates the utilization clamp of a CPU
> + * @cpu: the CPU which utilization clamp has to be updated
> + * @clamp_id: the clamp index to update
> + *
> + * When tasks are enqueued/dequeued on/from a CPU, the set of currently 
> active
> + * clamp groups is subject to change. Since each clamp group enforces a
> + * different utilization clamp value, once the set of these groups changes it
> + * can be required to re-compute what is the new clamp value to apply for 
> that
> + * CPU.
> + *
> + * For the specified clamp index, this method computes the new CPU 
> utilization
> + * clamp to use until the next change on the set of RUNNABLE tasks on that 
> CPU.
> + */
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +{
> +   struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> +   int max_value = UCLAMP_NONE;
> +   unsigned int group_id;
> +
> +   for (group_id = 0; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; 
> ++group_id) {
> +   /* Ignore inactive clamp groups, i.e. no

Re: [PATCH v2 02/12] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-07-20 Thread Suren Baghdasaryan

Hi Patrick,

On Fri, Jul 20, 2018 at 8:11 AM, Patrick Bellasi
 wrote:
> Hi Suren,
> thanks for the review, all good point... some more comments follow
> inline.
>
> On 19-Jul 16:51, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:28 AM, Patrick Bellasi
>>  wrote:
>
> [...]
>
>> > +/**
>> > + * uclamp_group_available: checks if a clamp group is available
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamp)
>> > + * @group_id: the group index in the given clamp_id
>> > + *
>> > + * A clamp group is not free if there is at least one SE which is sing a 
>> > clamp
>>
>> Did you mean to say "single clamp"?
>
> No, it's "...at least one SE which is USING a clamp value..."
>
>> > + * value mapped on the specified clamp_id. These SEs are reference 
>> > counted by
>> > + * the se_count of a uclamp_map entry.
>> > + *
>> > + * Return: true if there are no SE's mapped on the specified clamp
>> > + * index and group
>> > + */
>> > +static inline bool uclamp_group_available(int clamp_id, int group_id)
>> > +{
>> > +   struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +
>> > +   return (uc_map[group_id].value == UCLAMP_NONE);
>>
>> The usage of UCLAMP_NONE is very confusing to me. It was not used at
>> all in the patch where it was introduced [1/12], here it's used as a
>> clamp value and in uclamp_group_find() it's used as group_id. Please
>> clarify the usage.
>
> Yes, it's meant to represent a "clamp not valid" condition, whatever
> it's a "clamp group" or a "clamp value"... perhaps the name can be
> improved.
>
>> I also feel UCLAMP_NONE does not really belong to
>> the uclamp_id enum because other elements there are indexes in
>> uclamp_maps and this one is a special value.
>
> Right, it looks a bit misplaced, I agree. I think I tried to set it
> using a #define but there was some issues I don't remember now...
> Anyway, I'll give it another go...
>
>
>> IMHO if both *group_id*
>> and *value* need a special value (-1) to represent
>> unused/uninitialized entry it would be better to use different
>> constants. Maybe UCLAMP_VAL_NONE and UCLAMP_GROUP_NONE?
>
> Yes, maybe we can use a
>
>#define UCLAMP_NOT_VALID -1
>
> and get rid the confusing enum entry.
>
> Will update it on v3.
>

Sounds good to me.

>> > +}
>
> [...]
>
>> > +/**
>> > + * uclamp_group_find: finds the group index of a utilization clamp group
>> > + * @clamp_id: the utilization clamp index (i.e. min or max clamping)
>> > + * @clamp_value: the utilization clamping value lookup for
>> > + *
>> > + * Verify if a group has been assigned to a certain clamp value and return
>> > + * its index to be used for accounting.
>> > + *
>> > + * Since only a limited number of utilization clamp groups are allowed, 
>> > if no
>> > + * groups have been assigned for the specified value, a new group is 
>> > assigned
>> > + * if possible. Otherwise an error is returned, meaning that an 
>> > additional clamp
>> > + * value is not (currently) supported.
>> > + */
>> > +static int
>> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
>> > +{
>> > +   struct uclamp_map *uc_map = &uclamp_maps[clamp_id][0];
>> > +   int free_group_id = UCLAMP_NONE;
>> > +   unsigned int group_id = 0;
>> > +
>> > +   for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
>> > +   /* Keep track of first free clamp group */
>> > +   if (uclamp_group_available(clamp_id, group_id)) {
>> > +   if (free_group_id == UCLAMP_NONE)
>> > +   free_group_id = group_id;
>> > +   continue;
>> > +   }
>> > +   /* Return index of first group with same clamp value */
>> > +   if (uc_map[group_id].value == clamp_value)
>> > +   return group_id;
>> > +   }
>> > +   /* Default to first free clamp group */
>> > +   if (group_id > CONFIG_UCLAMP_GROUPS_COUNT)
>>
>> Is the condition above needed? I think it's always true if you got here.
>> Also AFAIKT after the for loop you can just do:
>>
>> return (free_group_id != UCLAMP_NONE) ? free_group_i

Re: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

2018-07-20 Thread Suren Baghdasaryan

Hi Patrick,

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
 wrote:
> When a util_max clamped task sleeps, its clamp constraints are removed
> from the CPU. However, the blocked utilization on that CPU can still be
> higher than the max clamp value enforced while that task was running.
> This max clamp removal when a CPU is going to be idle could thus allow
> unwanted CPU frequency increases, right while the task is not running.
>
> This can happen, for example, where there is another (smaller) task
> running on a different CPU of the same frequency domain.
> In this case, when we aggregates the utilization of all the CPUs in a

typo: we aggregate

> shared frequency domain, schedutil can still see the full non clamped
> blocked utilization of all the CPUs and thus eventually increase the
> frequency.
>
> Let's fix this by using:
>
>uclamp_cpu_put_id(UCLAMP_MAX)
>   uclamp_cpu_update(last_clamp_value)
>
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition. Thus, while a CPU is idle, we can still enforce the last used
> clamp value for it.
>
> To the contrary, we do not track any UCLAMP_MIN since, while a CPU is
> idle, we don't want to enforce any minimum frequency
> Indeed, we relay just on blocked load decay to smoothly reduce the

typo: We rely

> frequency.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Rafael J. Wysocki 
> Cc: Viresh Kumar 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  kernel/sched/core.c  | 30 ++
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 28 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b2424eea7990..0cb6e0aa4faa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -930,7 +930,8 @@ uclamp_group_find(int clamp_id, unsigned int clamp_value)
>   * For the specified clamp index, this method computes the new CPU 
> utilization
>   * clamp to use until the next change on the set of RUNNABLE tasks on that 
> CPU.
>   */
> -static inline void uclamp_cpu_update(struct rq *rq, int clamp_id)
> +static inline void uclamp_cpu_update(struct rq *rq, int clamp_id,
> +unsigned int last_clamp_value)
>  {
> struct uclamp_group *uc_grp = &rq->uclamp.group[clamp_id][0];
> int max_value = UCLAMP_NONE;
> @@ -948,6 +949,19 @@ static inline void uclamp_cpu_update(struct rq *rq, int 
> clamp_id)
> if (max_value >= SCHED_CAPACITY_SCALE)
> break;
> }
> +
> +   /*
> +* Just for the UCLAMP_MAX value, in case there are no RUNNABLE
> +* task, we keep the CPU clamped to the last task's clamp value.
> +* This avoids frequency spikes to MAX when one CPU, with an high
> +* blocked utilization, sleeps and another CPU, in the same frequency
> +* domain, do not see anymore the clamp on the first CPU.
> +*/
> +   if (clamp_id == UCLAMP_MAX && max_value == UCLAMP_NONE) {
> +   rq->uclamp.flags |= UCLAMP_FLAG_IDLE;
> +   max_value = last_clamp_value;
> +   }
> +
> rq->uclamp.value[clamp_id] = max_value;
>  }
>
> @@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct task_struct 
> *p,
> uc_grp = &rq->uclamp.group[clamp_id][0];
> uc_grp[group_id].tasks += 1;
>
> +   /* Force clamp update on idle exit */
> +   uc_cpu = &rq->uclamp;
> +   clamp_value = p->uclamp[clamp_id].value;
> +   if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {

The condition below is not needed because UCLAMP_FLAG_IDLE is set only
for UCLAMP_MAX clamp_id, therefore the above condition already covers
the one below.

> +   if (clamp_id == UCLAMP_MAX)
> +   uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
> +   uc_cpu->value[clamp_id] = clamp_value;
> +   return;
> +   }
> +
> /*
>  * If this is the new max utilization clamp value, then we can update
>  * straight away the CPU clamp value. Otherwise, the current CPU clamp
>  * value is still valid and we are done.
>  */
> -   uc_cpu = &rq->uclamp;
> -   clamp_value = p->uclamp[clamp_id].value;
> if (uc_cpu->value[clamp_id] < clamp_value)
> uc_cpu->value[clamp_id] = clamp_value;
>  }
> @@ -1028,7 +1050,7 @@ static inline void uclamp_cpu_put_id(struct task_struct 
> *p,
> uc_cpu = &rq->uclamp;
> clamp_value = uc_grp[group_id].value;
> if (clamp_value >= uc_cpu->value[clamp_id])
> -   uclamp_cpu_update(rq, clamp_id);
> +   uclamp_cpu_update(rq, clamp_id, clamp_value);
>  }
>
>  /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1207add36478..7e4f10c507b7 100644
> --- a/kernel/sched/sch

Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-20 Thread Suren Baghdasaryan

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
 wrote:
> The cgroup's CPU controller allows to assign a specified (maximum)
> bandwidth to the tasks of a group. However this bandwidth is defined and
> enforced only on a temporal base, without considering the actual
> frequency a CPU is running on. Thus, the amount of computation completed
> by a task within an allocated bandwidth can be very different depending
> on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Give the above mechanisms, it is now possible to extend the cpu
> controller to specify what is the minimum (or maximum) utilization which
> a task is expected (or allowed) to generate.
> Constraints on minimum and maximum utilization allowed for tasks in a
> CPU cgroup can improve the control on the actual amount of CPU bandwidth
> consumed by tasks.
>
> Utilization clamping constraints are useful not only to bias frequency
> selection, when a task is running, but also to better support certain
> scheduler decisions regarding task placement. For example, on
> asymmetric capacity systems, a utilization clamp value can be
> conveniently used to enforce important interactive tasks on more capable
> CPUs or to run low priority and background tasks on more energy
> efficient CPUs.
>
> The ultimate goal of utilization clamping is thus to enable:
>
> - boosting: by selecting an higher capacity CPU and/or higher execution
> frequency for small tasks which are affecting the user
> interactive experience.
>
> - capping: by selecting more energy efficiency CPUs or lower execution
>frequency, for big tasks which are mainly related to
>background activities, and thus without a direct impact on
>the user experience.
>
> Thus, a proper extension of the cpu controller with utilization clamping
> support will make this controller even more suitable for integration
> with advanced system management software (e.g. Android).
> Indeed, an informed user-space can provide rich information hints to the
> scheduler regarding the tasks it's going to schedule.
>
> This patch extends the CPU controller by adding a couple of new
> attributes, util_min and util_max, which can be used to enforce task's
> utilization boosting and capping. Specifically:
>
> - util_min: defines the minimum utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run at least at a minimum frequency which
> corresponds to the min_util utilization
>
> - util_max: defines the maximum utilization which should be considered,
> e.g. when schedutil selects the frequency for a CPU while a
> task in this group is RUNNABLE.
> i.e. the task will run up to a maximum frequency which
> corresponds to the max_util utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
>hierarchies
> b) do not enforce any constraints and/or dependency between the parent
>and its child nodes, thus relying on the delegation model and
>permission settings defined by the system management software
> c) allow to (eventually) further restrict task-specific clamps defined
>via sched_setattr(2)
>
> This patch provides the basic support to expose the two new attributes
> and to validate their run-time updates.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Rafael J. Wysocki 
> Cc: Viresh Kumar 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  25 
>  init/Kconfig|  22 +++
>  kernel/sched/core.c | 186 
>  kernel/sched/sched.h|   5 +
>  4 files changed, 238 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index 8a2c52d5c53b..328c011cc105 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -904,6 +904,12 @@ controller implements weight and absolute bandwidth 
> limit models for
>  normal scheduling policy and absolute bandwidth allocation model for
>  realtime scheduling policy.
>
> +Cycles distribution is based, by defa

Re: [PATCH v2 08/12] sched/core: uclamp: extend cpu's cgroup controller

2018-07-20 Thread Suren Baghdasaryan

On Fri, Jul 20, 2018 at 7:37 PM, Suren Baghdasaryan  wrote:
> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>  wrote:
>> The cgroup's CPU controller allows to assign a specified (maximum)
>> bandwidth to the tasks of a group. However this bandwidth is defined and
>> enforced only on a temporal base, without considering the actual
>> frequency a CPU is running on. Thus, the amount of computation completed
>> by a task within an allocated bandwidth can be very different depending
>> on the actual frequency the CPU is running that task.
>> The amount of computation can be affected also by the specific CPU a
>> task is running on, especially when running on asymmetric capacity
>> systems like Arm's big.LITTLE.
>>
>> With the availability of schedutil, the scheduler is now able
>> to drive frequency selections based on actual task utilization.
>> Moreover, the utilization clamping support provides a mechanism to
>> bias the frequency selection operated by schedutil depending on
>> constraints assigned to the tasks currently RUNNABLE on a CPU.
>>
>> Give the above mechanisms, it is now possible to extend the cpu
>> controller to specify what is the minimum (or maximum) utilization which
>> a task is expected (or allowed) to generate.
>> Constraints on minimum and maximum utilization allowed for tasks in a
>> CPU cgroup can improve the control on the actual amount of CPU bandwidth
>> consumed by tasks.
>>
>> Utilization clamping constraints are useful not only to bias frequency
>> selection, when a task is running, but also to better support certain
>> scheduler decisions regarding task placement. For example, on
>> asymmetric capacity systems, a utilization clamp value can be
>> conveniently used to enforce important interactive tasks on more capable
>> CPUs or to run low priority and background tasks on more energy
>> efficient CPUs.
>>
>> The ultimate goal of utilization clamping is thus to enable:
>>
>> - boosting: by selecting an higher capacity CPU and/or higher execution
>> frequency for small tasks which are affecting the user
>> interactive experience.
>>
>> - capping: by selecting more energy efficiency CPUs or lower execution
>>frequency, for big tasks which are mainly related to
>>background activities, and thus without a direct impact on
>>the user experience.
>>
>> Thus, a proper extension of the cpu controller with utilization clamping
>> support will make this controller even more suitable for integration
>> with advanced system management software (e.g. Android).
>> Indeed, an informed user-space can provide rich information hints to the
>> scheduler regarding the tasks it's going to schedule.
>>
>> This patch extends the CPU controller by adding a couple of new
>> attributes, util_min and util_max, which can be used to enforce task's
>> utilization boosting and capping. Specifically:
>>
>> - util_min: defines the minimum utilization which should be considered,
>> e.g. when schedutil selects the frequency for a CPU while a
>> task in this group is RUNNABLE.
>> i.e. the task will run at least at a minimum frequency which
>> corresponds to the min_util utilization
>>
>> - util_max: defines the maximum utilization which should be considered,
>> e.g. when schedutil selects the frequency for a CPU while a
>> task in this group is RUNNABLE.
>> i.e. the task will run up to a maximum frequency which
>> corresponds to the max_util utilization
>>
>> These attributes:
>>
>> a) are available only for non-root nodes, both on default and legacy
>>hierarchies
>> b) do not enforce any constraints and/or dependency between the parent
>>and its child nodes, thus relying on the delegation model and
>>permission settings defined by the system management software
>> c) allow to (eventually) further restrict task-specific clamps defined
>>via sched_setattr(2)
>>
>> This patch provides the basic support to expose the two new attributes
>> and to validate their run-time updates.
>>
>> Signed-off-by: Patrick Bellasi 
>> Cc: Ingo Molnar 
>> Cc: Peter Zijlstra 
>> Cc: Tejun Heo 
>> Cc: Rafael J. Wysocki 
>> Cc: Viresh Kumar 
>> Cc: Todd Kjos 
>> Cc: Joel Fernandes 
>> Cc: Juri Lelli 
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> ---
>>  Documentation/admin-guide/

Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

2018-07-21 Thread Suren Baghdasaryan

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
 wrote:
> When a task's util_clamp value is configured via sched_setattr(2), this
> value has to be properly accounted in the corresponding clamp group
> every time the task is enqueued and dequeued. When cgroups are also in
> use, per-task clamp values have to be aggregated to those of the CPU's
> controller's Task Group (TG) in which the task is currently living.
>
> Let's update uclamp_cpu_get() to provide aggregation between the task
> and the TG clamp values. Every time a task is enqueued, it will be
> accounted in the clamp_group which defines the smaller clamp between the
> task specific value and its TG value.

So choosing smallest for both UCLAMP_MIN and UCLAMP_MAX means the
least boosted value and the most clamped value between syscall and TG
will be used. My understanding is that boost means "at least this
much" and clamp means "at most this much". So to satisfy both TG and
syscall requirements I think you would need to choose the largest
value for UCLAMP_MIN and the smallest one for UCLAMP_MAX, meaning the
most boosted and most clamped range. Current implementation choses the
least boosted value, so effectively one of the UCLAMP_MIN requirements
(either from TG or from syscall) are being ignored...
Could you please clarify why this choice is made?

>
> This also mimics what already happen for a task's CPU affinity mask when
> the task is also living in a cpuset.  he overall idea is that cgroup

typo: The overall...

> attributes are always used to restrict the per-task attributes.
>
> Thus, this implementation allows to:
>
> 1. ensure cgroup clamps are always used to restrict task specific
>requests, i.e. boosted only up to a granted value or clamped at least
>to a certain value
> 2. implements a "nice-like" policy, where tasks are still allowed to
>request less then what enforced by their current TG
>
> For this mecanisms to work properly, we need to implement a concept of
> "effective" clamp group, which is used to track the currently most
> restrictive clamp value each task is subject to.
> The effective clamp is computed at enqueue time, by using an additional
>task_struct::uclamp_group_id
> to keep track of the clamp group in which each task is currently
> accounted into. This solution allows to update task constrains on
> demand, only when they became RUNNABLE, to always get the least
> restrictive clamp depending on the current TG's settings.
>
> This solution allows also to better decouple the slow-path, where task
> and task group clamp values are updated, from the fast-path, where the
> most appropriate clamp value is tracked by refcounting clamp groups.
>
> For consistency purposes, as well as to properly inform userspace, the
> sched_getattr(2) call is updated to always return the properly
> aggregated constrains as described above. This will also make
> sched_getattr(2) a convenient userpace API to know the utilization
> constraints enforced on a task by the cgroup's CPU controller.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Paul Turner 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Steve Muckle 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  include/linux/sched.h |  2 ++
>  kernel/sched/core.c   | 40 +++-
>  kernel/sched/sched.h  |  2 +-
>  3 files changed, 38 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 260aa8d3fca9..5dd76a27ec17 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -676,6 +676,8 @@ struct task_struct {
> struct sched_dl_entity  dl;
>
>  #ifdef CONFIG_UCLAMP_TASK
> +   /* Clamp group the task is currently accounted into */
> +   int uclamp_group_id[UCLAMP_CNT];
> /* Utlization clamp values for this task */
> struct uclamp_seuclamp[UCLAMP_CNT];
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 04e758224e22..50613d3d5b83 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -971,8 +971,15 @@ static inline void uclamp_cpu_update(struct rq *rq, int 
> clamp_id,
>   * @rq: the CPU's rq where the clamp group has to be reference counted
>   * @clamp_id: the utilization clamp (e.g. min or max utilization) to 
> reference
>   *
> - * Once a task is enqueued on a CPU's RQ, the clamp group currently defined 
> by
> - * the task's uclamp.group_id is reference counted on that CPU.
> + * Once a task is enqueued on a CPU's RQ, the most restrictive clamp group,
> + * among the task specific and that of the task's cgroup one, is reference
> + * counted on that CPU.
> + *
> + * Since the CPUs reference counted clamp group can be either that of the 
> task
> + * or of its cgroup, we keep track of the reference counted clamp group by
> + * sto

Re: [PATCH v2 11/12] sched/core: uclamp: update CPU's refcount on TG's clamp changes

2018-07-21 Thread Suren Baghdasaryan

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
 wrote:
> When a task group refcounts a new clamp group, we need to ensure that
> the new clamp values are immediately enforced to all its tasks which are
> currently RUNNABLE. This is to ensure that all currently RUNNABLE task

tasks

> are boosted and/or clamped as requested as soon as possible.
>
> Let's ensure that, whenever a new clamp group is refcounted by a task
> group, all its RUNNABLE tasks are correctly accounted in their
> respective CPUs. We do that by slightly refactoring uclamp_group_get()
> to get an additional parameter *cgroup_subsys_state which, when
> provided, it's used to walk the list of tasks in the correspond TGs and

corresponding TGs

> update the RUNNABLE ones.
>
> This is a "brute force" solution which allows to reuse the same refcount
> update code already used by the per-task API. That's also the only way
> to ensure a prompt enforcement of new clamp constraints on RUNNABLE
> tasks, as soon as a task group attribute is tweaked.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Paul Turner 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Steve Muckle 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  kernel/sched/core.c | 42 ++
>  1 file changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 50613d3d5b83..42cff5ffddae 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1198,21 +1198,43 @@ static inline void uclamp_group_put(int clamp_id, int 
> group_id)
> raw_spin_unlock_irqrestore(&uc_map[group_id].se_lock, flags);
>  }
>
> +static inline void uclamp_group_get_tg(struct cgroup_subsys_state *css,
> +  int clamp_id, unsigned int group_id)
> +{
> +   struct css_task_iter it;
> +   struct task_struct *p;
> +
> +   /* Update clamp groups for RUNNABLE tasks in this TG */
> +   css_task_iter_start(css, 0, &it);
> +   while ((p = css_task_iter_next(&it)))
> +   uclamp_task_update_active(p, clamp_id, group_id);
> +   css_task_iter_end(&it);
> +}
> +
>  /**
>   * uclamp_group_get: increase the reference count for a clamp group
>   * @p: the task which clamp value must be tracked
> - * @clamp_id: the clamp index affected by the task
> - * @uc_se: the utilization clamp data for the task
> - * @clamp_value: the new clamp value for the task
> + * @css: the task group which clamp value must be tracked
> + * @clamp_id: the clamp index affected by the task (group)
> + * @uc_se: the utilization clamp data for the task (group)
> + * @clamp_value: the new clamp value for the task (group)
>   *
>   * Each time a task changes its utilization clamp value, for a specified 
> clamp
>   * index, we need to find an available clamp group which can be used to track
>   * this new clamp value. The corresponding clamp group index will be used by
>   * the task to reference count the clamp value on CPUs while enqueued.
>   *
> + * When the cgroup's cpu controller utilization clamping support is enabled,
> + * each task group has a set of clamp values which are used to restrict the
> + * corresponding task specific clamp values.
> + * When a clamp value for a task group is changed, all the (active) tasks
> + * belonging to that task group must be update to ensure they are refcounting

must be updated

> + * the correct CPU's clamp value.
> + *
>   * Return: -ENOSPC if there are no available clamp groups, 0 on success.
>   */
>  static inline int uclamp_group_get(struct task_struct *p,
> +  struct cgroup_subsys_state *css,
>int clamp_id, struct uclamp_se *uc_se,
>unsigned int clamp_value)
>  {
> @@ -1240,6 +1262,10 @@ static inline int uclamp_group_get(struct task_struct 
> *p,
> uc_map[next_group_id].se_count += 1;
> raw_spin_unlock_irqrestore(&uc_map[next_group_id].se_lock, flags);
>
> +   /* Newly created TG don't have tasks assigned */
> +   if (css)
> +   uclamp_group_get_tg(css, clamp_id, next_group_id);
> +
> /* Update CPU's clamp group refcounts of RUNNABLE task */
> if (p)
> uclamp_task_update_active(p, clamp_id, next_group_id);
> @@ -1307,7 +1333,7 @@ static inline int alloc_uclamp_sched_group(struct 
> task_group *tg,
> uc_se->value = parent->uclamp[clamp_id].value;
> uc_se->group_id = UCLAMP_NONE;
>
> -   if (uclamp_group_get(NULL, clamp_id, uc_se,
> +   if (uclamp_group_get(NULL, NULL, clamp_id, uc_se,
>  parent->uclamp[clamp_id].value)) {
> ret = 0;
> goto out;
> @@ -1362,12 +1388,12 @@ static inline int __setschedul

Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

2018-07-21 Thread Suren Baghdasaryan

On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
 wrote:
> The utilization is a well defined property of tasks and CPUs with an
> in-kernel representation based on power-of-two values.
> The current representation, in the [0..SCHED_CAPACITY_SCALE] range,
> allows efficient computations in hot-paths and a sufficient fixed point
> arithmetic precision.
> However, the utilization values range is still an implementation detail
> which is also possibly subject to changes in the future.
>
> Since we don't want to commit new user-space APIs to any in-kernel
> implementation detail, let's add an abstraction layer on top of the APIs
> used by util_clamp, i.e. sched_{set,get}attr syscalls and the cgroup's
> cpu.util_{min,max} attributes.
>
> We do that by adding a couple of conversion function which can be used

couple of conversion functions

> to conveniently transform utilization/capacity values from/to the internal
> SCHED_FIXEDPOINT_SCALE representation to/from a more generic percentage
> in the standard [0..100] range.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Rafael J. Wysocki 
> Cc: Paul Turner 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Steve Muckle 
> Cc: Juri Lelli 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  6 +++---
>  include/linux/sched.h   | 20 
>  include/uapi/linux/sched/types.h| 14 --
>  kernel/sched/core.c | 18 --
>  4 files changed, 43 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index 328c011cc105..08b8062e55cd 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -973,7 +973,7 @@ All time durations are in microseconds.
>  A read-write single value file which exists on non-root cgroups.
>  The default is "0", i.e. no bandwidth boosting.
>
> -The minimum utilization in the range [0, 1023].
> +The minimum percentage of utilization in the range [0, 100].
>
>  This interface allows reading and setting minimum utilization clamp
>  values similar to the sched_setattr(2). This minimum utilization
> @@ -981,9 +981,9 @@ All time durations are in microseconds.
>
>cpu.util_max
>  A read-write single value file which exists on non-root cgroups.
> -The default is "1023". i.e. no bandwidth clamping
> +The default is "100". i.e. no bandwidth clamping
>
> -The maximum utilization in the range [0, 1023].
> +The maximum percentage of utilization in the range [0, 100].
>
>  This interface allows reading and setting maximum utilization clamp
>  values similar to the sched_setattr(2). This maximum utilization
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5dd76a27ec17..f5970903c187 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -321,6 +321,26 @@ struct sched_info {
>  # define SCHED_FIXEDPOINT_SHIFT10
>  # define SCHED_FIXEDPOINT_SCALE(1L << SCHED_FIXEDPOINT_SHIFT)
>
> +static inline unsigned int scale_from_percent(unsigned int pct)
> +{
> +   WARN_ON(pct > 100);
> +
> +   return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
> +}
> +
> +static inline unsigned int scale_to_percent(unsigned int value)
> +{
> +   unsigned int rounding = 0;
> +
> +   WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
> +
> +   /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
> +   if (likely((value & 0xFF) && ~(value & 0x700)))
> +   rounding = 1;

Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I missing?

> +
> +   return (rounding + ((100 * value) / SCHED_FIXEDPOINT_SCALE));
> +}
> +
>  struct load_weight {
> unsigned long   weight;
> u32 inv_weight;
> diff --git a/include/uapi/linux/sched/types.h 
> b/include/uapi/linux/sched/types.h
> index 7421cd25354d..e2c2acb1c6af 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -84,15 +84,17 @@ struct sched_param {
>   *
>   *  @sched_util_minrepresents the minimum utilization
>   *  @sched_util_maxrepresents the maximum utilization
> + *  @sched_util_minrepresents the minimum utilization percentage
> + *  @sched_util_maxrepresents the maximum utilization percentage
>   *
> - * Utilization is a value in the range [0..SCHED_CAPACITY_SCALE] which
> - * represents the percentage of CPU time used by a task when running at the
> - * maximum frequency on the highest capacity CPU of the system. Thus, for
> - * example, a 20% utilization task is a task running for 2ms every 10ms.
> + * Utilization is a value in the range [0..100] which represents the
> + * percentage of CPU time u

Re: [PATCH v2 07/12] sched/core: uclamp: enforce last task UCLAMP_MAX

2018-07-23 Thread Suren Baghdasaryan

On Mon, Jul 23, 2018 at 8:02 AM, Patrick Bellasi
 wrote:
> On 20-Jul 18:23, Suren Baghdasaryan wrote:
>> Hi Patrick,
>
> Hi Sure,
> thank!
>
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>>  wrote:
>
> [...]
>
>> > @@ -977,13 +991,21 @@ static inline void uclamp_cpu_get_id(struct 
>> > task_struct *p,
>> > uc_grp = &rq->uclamp.group[clamp_id][0];
>> > uc_grp[group_id].tasks += 1;
>> >
>> > +   /* Force clamp update on idle exit */
>> > +   uc_cpu = &rq->uclamp;
>> > +   clamp_value = p->uclamp[clamp_id].value;
>> > +   if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
>>
>> The condition below is not needed because UCLAMP_FLAG_IDLE is set only
>> for UCLAMP_MAX clamp_id, therefore the above condition already covers
>> the one below.
>
> Not really, this function is called two times, the first time to
> update UCLAMP_MIN and a second time to update UCLAMP_MAX.
>
> For both clamp_id we want to force update uc_cpu->value[clamp_id],
> thus the UCLAMP_FLAG_IDLE flag has to be cleared only the second time.
>
> Maybe I can had the following comment to better explain the reason of
> the check:
>
> /*
>  * This function is called for both UCLAMP_MIN (before) and
>  * UCLAMP_MAX (after). Let's reset the flag only the when
>  * we know that UCLAMP_MIN has been already updated.
>  */
>

Ah, my bad. I missed the fact that uc_cpu->flags is shared for both
UCLAMP_MIN and UCLAMP_MAX. It's fine the way it originally was. Thanks
for explanation!

>> > +   if (clamp_id == UCLAMP_MAX)
>> > +   uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
>> > +   uc_cpu->value[clamp_id] = clamp_value;
>> > +   return;
>> > +   }
>
> [...]
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

2018-07-23 Thread Suren Baghdasaryan

On Mon, Jul 23, 2018 at 8:40 AM, Patrick Bellasi
 wrote:
> On 21-Jul 20:05, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>>  wrote:
>> > When a task's util_clamp value is configured via sched_setattr(2), this
>> > value has to be properly accounted in the corresponding clamp group
>> > every time the task is enqueued and dequeued. When cgroups are also in
>> > use, per-task clamp values have to be aggregated to those of the CPU's
>> > controller's Task Group (TG) in which the task is currently living.
>> >
>> > Let's update uclamp_cpu_get() to provide aggregation between the task
>> > and the TG clamp values. Every time a task is enqueued, it will be
>> > accounted in the clamp_group which defines the smaller clamp between the
>> > task specific value and its TG value.
>>
>> So choosing smallest for both UCLAMP_MIN and UCLAMP_MAX means the
>> least boosted value and the most clamped value between syscall and TG
>> will be used.
>
> Right
>
>> My understanding is that boost means "at least this much" and clamp
>> means "at most this much".
>
> Right
>
>> So to satisfy both TG and syscall requirements I think you would
>> need to choose the largest value for UCLAMP_MIN and the smallest one
>> for UCLAMP_MAX, meaning the most boosted and most clamped range.
>> Current implementation choses the least boosted value, so
>> effectively one of the UCLAMP_MIN requirements (either from TG or
>> from syscall) are being ignored...  Could you please clarify why
>> this choice is made?
>
> The TG values are always used to specify a _restriction_ on
> task-specific values.
>
> Thus, if you look or example at the CPU mask for a task, you can have
> a task with affinity to CPUs 0-1, currently running on a cgroup with
> cpuset.cpus=0... then the task can run only on CPU 0 (althought its
> affinity includes CPU1 too).
>
> Same we do here: if a task has util_min=10, but it's running in a
> cgroup with cpu.util_min=0, then it will not be boosted.
>
> IOW, this allows to implement a "nice" policy at task level, where a
> task (via syscall) can decide to be less boosted with respect to its
> group but never more boosted. The same task can also decide to be more
> clamped, but not less clamped then its current group.
>

The fact that boost means "at least this much" to me seems like we can
safely choose higher CPU bandwidth (as long as it's lower than
UCLAMP_MAX) but from your description sounds like TG's UCLAMP_MIN
means "at most this much boost" and it's not safe to use CPU bandwidth
higher than TG's UCLAMP_MIN. So instead of specifying min CPU
bandwidth for a task it specifies the max allowed boost. Seems like a
discrepancy to me but maybe there are compelling usecases when this
behavior is necessary? In that case would be good to spell them out to
explain why this choice is made.

> [...]
>
>> > @@ -982,18 +989,30 @@ static inline void uclamp_cpu_get_id(struct 
>> > task_struct *p,
>> > int clamp_value;
>> > int group_id;
>> >
>> > -   /* No task specific clamp values: nothing to do */
>> > group_id = p->uclamp[clamp_id].group_id;
>> > +   clamp_value = p->uclamp[clamp_id].value;
>> > +#ifdef CONFIG_UCLAMP_TASK_GROUP
>> > +   /* Use TG's clamp value to limit task specific values */
>> > +   if (group_id == UCLAMP_NONE ||
>> > +   clamp_value >= task_group(p)->uclamp[clamp_id].value) {
>>
>> Not a big deal but do you need to override if (clamp_value ==
>> task_group(p)->uclamp[clamp_id].value)? Maybe:
>> -   clamp_value >= task_group(p)->uclamp[clamp_id].value) {
>> +  clamp_value > task_group(p)->uclamp[clamp_id].value) {
>
> Good point, yes... the override is not really changing anything here.
> Will fix this!
>
>> > +   clamp_value = task_group(p)->uclamp[clamp_id].value;
>> > +   group_id = task_group(p)->uclamp[clamp_id].group_id;
>> > +   }
>> > +#else
>> > +   /* No task specific clamp values: nothing to do */
>> > if (group_id == UCLAMP_NONE)
>> > return;
>> > +#endif
>> >
>> > /* Reference count the task into its current group_id */
>> > uc_grp = &rq->uclamp.group[clamp_id][0];
>> > uc_grp[group_id].tasks += 1;
>> >
>> > +   /* Track the effective clamp group */
>> > +   p->uclamp_group_id[clamp_id] = group_id;
>> > +
>> > /* Force clamp update on idle exit */
>> > uc_cpu = &rq->uclamp;
>> > -   clamp_value = p->uclamp[clamp_id].value;
>> > if (unlikely(uc_cpu->flags & UCLAMP_FLAG_IDLE)) {
>> > if (clamp_id == UCLAMP_MAX)
>> > uc_cpu->flags &= ~UCLAMP_FLAG_IDLE;
>
> [...]
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v2 10/12] sched/core: uclamp: use TG's clamps to restrict Task's clamps

2018-07-24 Thread Suren Baghdasaryan

Hi Patrick. Thanks for the explanation and links. No more questions
from me on this one :)

On Tue, Jul 24, 2018 at 2:56 AM, Patrick Bellasi
 wrote:
> On 23-Jul 10:11, Suren Baghdasaryan wrote:
>> On Mon, Jul 23, 2018 at 8:40 AM, Patrick Bellasi
>>  wrote:
>> > On 21-Jul 20:05, Suren Baghdasaryan wrote:
>> >> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>
> [...]
>
>> >> So to satisfy both TG and syscall requirements I think you would
>> >> need to choose the largest value for UCLAMP_MIN and the smallest one
>> >> for UCLAMP_MAX, meaning the most boosted and most clamped range.
>> >> Current implementation choses the least boosted value, so
>> >> effectively one of the UCLAMP_MIN requirements (either from TG or
>> >> from syscall) are being ignored...  Could you please clarify why
>> >> this choice is made?
>> >
>> > The TG values are always used to specify a _restriction_ on
>> > task-specific values.
>> >
>> > Thus, if you look or example at the CPU mask for a task, you can have
>> > a task with affinity to CPUs 0-1, currently running on a cgroup with
>> > cpuset.cpus=0... then the task can run only on CPU 0 (althought its
>> > affinity includes CPU1 too).
>> >
>> > Same we do here: if a task has util_min=10, but it's running in a
>> > cgroup with cpu.util_min=0, then it will not be boosted.
>> >
>> > IOW, this allows to implement a "nice" policy at task level, where a
>> > task (via syscall) can decide to be less boosted with respect to its
>> > group but never more boosted. The same task can also decide to be more
>> > clamped, but not less clamped then its current group.
>> >
>>
>> The fact that boost means "at least this much" to me seems like we can
>> safely choose higher CPU bandwidth (as long as it's lower than
>> UCLAMP_MAX)
>
> I understand your view point, which actually is matching my first
> implementation for util_min aggregation:
>
>https://lore.kernel.org/lkml/20180409165615.2326-5-patrick.bell...@arm.com/
>
>
>> but from your description sounds like TG's UCLAMP_MIN means "at most
>> this much boost" and it's not safe to use CPU bandwidth higher than
>> TG's UCLAMP_MIN.
>
> Indeed, after this discussion with Tejun:
>
>
> https://lore.kernel.org/lkml/20180409222417.gk3126...@devbig577.frc2.facebook.com/
>
> I've convinced myself that for the cgroup interface we have to got for
> a "restrictive" interface where a parent value must set the upper
> bound for all its descendants values. AFAIU, that's one of the basic
> principles of the "delegation model" implemented by cgroups and the
> common behavior implemented by all controllers.
>
>> So instead of specifying min CPU bandwidth for a task it specifies
>> the max allowed boost. Seems like a discrepancy to me but maybe
>> there are compelling usecases when this behavior is necessary?
>
> I don't think it's strictly related to use-cases, you can always
> describe a give use-case in one model or the other.  It all depends on
> how you configure your hierarchy and where you place your tasks.
>
> For our Android use cases, we are still happy to say that all tasks of
> a CGroup can be boosted up to a certain value and then we can either:
> - don't configure tasks: and thus get the CG defined boost
> - configure a task: and explicitly give back what we don't need
>
> This model works quite well with containers, where the parent want to
> precisely control how much resources are (eventually) usable by a
> given container.
>
>> In that case would be good to spell them out to explain why this
>> choice is made.
>
> Yes, well... if I understand it correctly is really just the
> recommended way cgroups must be used to re-partition resources.
>
> I'll try to better explain this behavior in the changelog for this
> patch.
>
> [...]
>
> Best,
> Patrick
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v2 12/12] sched/core: uclamp: use percentage clamp values

2018-07-24 Thread Suren Baghdasaryan

On Tue, Jul 24, 2018 at 9:43 AM, Patrick Bellasi
 wrote:
> On 21-Jul 21:04, Suren Baghdasaryan wrote:
>> On Mon, Jul 16, 2018 at 1:29 AM, Patrick Bellasi
>>  wrote:
>
> [...]
>
>> > +static inline unsigned int scale_from_percent(unsigned int pct)
>> > +{
>> > +   WARN_ON(pct > 100);
>> > +
>> > +   return ((SCHED_FIXEDPOINT_SCALE * pct) / 100);
>> > +}
>> > +
>> > +static inline unsigned int scale_to_percent(unsigned int value)
>> > +{
>> > +   unsigned int rounding = 0;
>> > +
>> > +   WARN_ON(value > SCHED_FIXEDPOINT_SCALE);
>> > +
>> > +   /* Compensate rounding errors for: 0, 256, 512, 768, 1024 */
>> > +   if (likely((value & 0xFF) && ~(value & 0x700)))
>> > +   rounding = 1;
>>
>> Hmm. I don't think ~(value & 0x700) will ever yield FALSE... What am I 
>> missing?
>
> So, 0x700 is the topmost 3 bits sets (111  ) which different
> configuration corresponds to:
>
>  001   =>  256
>  010   =>  512
>  011   =>  768
>  100   => 1024
>
> Thus, if 0x700 matches then we have one of these values in input and
> for these cases we have to add a unit to the percentage value.
>
> For the case (value == 0) we translate it into 0% thanks to the check
> on (value & 0xFF) to ensure rounding = 0.
>

I think just (value & 0xFF) is enough to get you the right behavior.
~(value & 0x700) is not needed, it's effectively a NoOp which always
yields TRUE. For any *value* (value & 0x700) == 0x...00 and ~(value &
0x700) == 0x...FF == TRUE.

> Here is a small python snippet I've used to check the conversion of
> all the possible percentage values:
>
> ---8<---
> values = range(0, 101)
> for pct in xrange(0, 101):
> util = int((1024 * pct) / 100)
> rounding = 1
> if not ((util & 0xFF) and ~(util & 0x700)):
> print "Fixing util_to_perc({:3d} => {:4d})".format(pct, util)
> rounding = 0
> pct2 = (rounding + ((100 * util) / 1024))
> if pct2 in values:
> values.remove(pct2)
> if pct != pct2:
> print "Convertion failed for: {:3d} => {:4d} => 
> {:3d}".format(pct, util, pct2)
> if values:
> print "ERROR: not all percentage values converted"
> ---8<---
>
> --
> #include 
>
> Patrick Bellasi

Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure

2018-07-13 Thread Suren Baghdasaryan

On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner  wrote:
> Right now, psi reports pressure and stall times of already concluded
> stall events. For most use cases this is current enough, but certain
> highly latency-sensitive applications, like the Android OOM killer,

to be more precise, it's Android LMKD (low memory killer daemon) not
to be confused with kernel OOM killer.

> might want to know about and react to stall states before they have
> even concluded (e.g. a prolonged reclaim cycle).
>
> This patches the procfs/cgroupfs interface such that when the pressure
> metrics are read, the current per-cpu states, if any, are taken into
> account as well.
>
> Any ongoing states are concluded, their time snapshotted, and then
> restarted. This requires holding the rq lock to avoid corruption. It
> could use some form of rq lock ratelimiting or avoidance.
>
> Requested-by: Suren Baghdasaryan 
> Not-yet-signed-off-by: Johannes Weiner 
> ---

IMHO this description is a little difficult to understand. In essence,
PSI information is being updated periodically every 2secs and without
this patch the data can be stale at the time when we read it (because
it was last updated up to 2secs ago). To avoid this we update the PSI
"total" values when data is being read.

>  kernel/sched/psi.c | 56 +-
>  1 file changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 53e0b7b83e2e..5a6c6057f775 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -190,7 +190,7 @@ static void calc_avgs(unsigned long avg[3], u64 time, int 
> missed_periods)
> }
>  }
>
> -static bool psi_update_stats(struct psi_group *group)
> +static bool psi_update_stats(struct psi_group *group, bool ondemand)
>  {
> u64 some[NR_PSI_RESOURCES] = { 0, };
> u64 full[NR_PSI_RESOURCES] = { 0, };
> @@ -200,8 +200,6 @@ static bool psi_update_stats(struct psi_group *group)
> int cpu;
> int r;
>
> -   mutex_lock(&group->stat_lock);
> -
> /*
>  * Collect the per-cpu time buckets and average them into a
>  * single time sample that is normalized to wallclock time.
> @@ -218,10 +216,36 @@ static bool psi_update_stats(struct psi_group *group)
> for_each_online_cpu(cpu) {
> struct psi_group_cpu *groupc = per_cpu_ptr(group->cpus, cpu);
> unsigned long nonidle;
> +   struct rq_flags rf;
> +   struct rq *rq;
> +   u64 now;
>
> -   if (!groupc->nonidle_time)
> +   if (!groupc->nonidle_time && !groupc->nonidle)
> continue;
>
> +   /*
> +* We come here for two things: 1) periodic per-cpu
> +* bucket flushing and averaging and 2) when the user
> +* wants to read a pressure file. For flushing and
> +* averaging, which is relatively infrequent, we can
> +* be lazy and tolerate some raciness with concurrent
> +* updates to the per-cpu counters. However, if a user
> +* polls the pressure state, we want to give them the
> +* most uptodate information we have, including any
> +* currently active state which hasn't been timed yet,
> +* because in case of an iowait or a reclaim run, that
> +* can be significant.
> +*/
> +   if (ondemand) {
> +   rq = cpu_rq(cpu);
> +   rq_lock_irq(rq, &rf);
> +
> +   now = cpu_clock(cpu);
> +
> +   groupc->nonidle_time += now - groupc->nonidle_start;
> +   groupc->nonidle_start = now;
> +   }
> +
> nonidle = nsecs_to_jiffies(groupc->nonidle_time);
> groupc->nonidle_time = 0;
> nonidle_total += nonidle;
> @@ -229,13 +253,27 @@ static bool psi_update_stats(struct psi_group *group)
> for (r = 0; r < NR_PSI_RESOURCES; r++) {
> struct psi_resource *res = &groupc->res[r];
>
> +   if (ondemand && res->state != PSI_NONE) {
> +   bool is_full = res->state == PSI_FULL;
> +
> +   res->times[is_full] += now - res->state_start;
> +   res->state_start = now;
> +   }
> +
> some[r] += (res->times[0] + res->times[1]) * nonidle;
>

Re: [RFC PATCH 10/10] psi: aggregate ongoing stall events when somebody reads pressure

2018-07-13 Thread Suren Baghdasaryan

On Fri, Jul 13, 2018 at 3:49 PM, Johannes Weiner  wrote:
> On Fri, Jul 13, 2018 at 03:13:07PM -0700, Suren Baghdasaryan wrote:
>> On Thu, Jul 12, 2018 at 10:29 AM, Johannes Weiner  wrote:
>> > might want to know about and react to stall states before they have
>> > even concluded (e.g. a prolonged reclaim cycle).
>> >
>> > This patches the procfs/cgroupfs interface such that when the pressure
>> > metrics are read, the current per-cpu states, if any, are taken into
>> > account as well.
>> >
>> > Any ongoing states are concluded, their time snapshotted, and then
>> > restarted. This requires holding the rq lock to avoid corruption. It
>> > could use some form of rq lock ratelimiting or avoidance.
>> >
>> > Requested-by: Suren Baghdasaryan 
>> > Not-yet-signed-off-by: Johannes Weiner 
>> > ---
>>
>> IMHO this description is a little difficult to understand. In essence,
>> PSI information is being updated periodically every 2secs and without
>> this patch the data can be stale at the time when we read it (because
>> it was last updated up to 2secs ago). To avoid this we update the PSI
>> "total" values when data is being read.
>
> That fix I actually folded into the main patch. We now always update
> the total= field at the time the user reads to include all concluded
> events, even if we sampled less than 2s ago. Only the running averages
> are still bound to the 2s sampling window.
>
> What this patch adds on top is for total= to include any *ongoing*
> stall events that might be happening on a CPU at the time of reading
> from the interface, like a reclaim cycle that hasn't finished yet.

Ok, I see now what you mean. So ondemand flag controls whether
*ongoing* stall events are accounted for or not. Nit: maybe rename
that flag to better explain it's function?

Re: [PATCH] dm bufio: fix shrinker scans when (nr_to_scan < retain_target)

2018-01-04 Thread Suren Baghdasaryan

Dear kernel maintainers. I know it was close to holiday season when I
send this patch last month, so delay was expected. Could you please
take a look at it and provide your feedback?
Thanks!

On Wed, Dec 6, 2017 at 9:27 AM, Suren Baghdasaryan  wrote:
> When system is under memory pressure it is observed that dm bufio
> shrinker often reclaims only one buffer per scan. This change fixes
> the following two issues in dm bufio shrinker that cause this behavior:
>
> 1. ((nr_to_scan - freed) <= retain_target) condition is used to
> terminate slab scan process. This assumes that nr_to_scan is equal
> to the LRU size, which might not be correct because do_shrink_slab()
> in vmscan.c calculates nr_to_scan using multiple inputs.
> As a result when nr_to_scan is less than retain_target (64) the scan
> will terminate after the first iteration, effectively reclaiming one
> buffer per scan and making scans very inefficient. This hurts vmscan
> performance especially because mutex is acquired/released every time
> dm_bufio_shrink_scan() is called.
> New implementation uses ((LRU size - freed) <= retain_target)
> condition for scan termination. LRU size can be safely determined
> inside __scan() because this function is called after dm_bufio_lock().
>
> 2. do_shrink_slab() uses value returned by dm_bufio_shrink_count() to
> determine number of freeable objects in the slab. However dm_bufio
> always retains retain_target buffers in its LRU and will terminate
> a scan when this mark is reached. Therefore returning the entire LRU size
> from dm_bufio_shrink_count() is misleading because that does not
> represent the number of freeable objects that slab will reclaim during
> a scan. Returning (LRU size - retain_target) better represents the
> number of freeable objects in the slab. This way do_shrink_slab()
> returns 0 when (LRU size < retain_target) and vmscan will not try to
> scan this shrinker avoiding scans that will not reclaim any memory.
>
> Test: tested using Android device running
> /system/extras/alloc-stress that generates memory pressure
> and causes intensive shrinker scans
>
> Signed-off-by: Suren Baghdasaryan 
> ---
>  drivers/md/dm-bufio.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index b8ac5917..c546b567f3b5 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -1611,7 +1611,8 @@ static unsigned long __scan(struct dm_bufio_client *c, 
> unsigned long nr_to_scan,
> int l;
> struct dm_buffer *b, *tmp;
> unsigned long freed = 0;
> -   unsigned long count = nr_to_scan;
> +   unsigned long count = c->n_buffers[LIST_CLEAN] +
> + c->n_buffers[LIST_DIRTY];
> unsigned long retain_target = get_retain_buffers(c);
>
> for (l = 0; l < LIST_SIZE; l++) {
> @@ -1647,8 +1648,11 @@ static unsigned long
>  dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> struct dm_bufio_client *c = container_of(shrink, struct 
> dm_bufio_client, shrinker);
> +   unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
> + READ_ONCE(c->n_buffers[LIST_DIRTY]);
> +   unsigned long retain_target = get_retain_buffers(c);
>
> -   return READ_ONCE(c->n_buffers[LIST_CLEAN]) + 
> READ_ONCE(c->n_buffers[LIST_DIRTY]);
> +   return (count < retain_target) ? 0 : (count - retain_target);
>  }
>
>  /*
> --
> 2.15.0.531.g2ccb3012c9-goog
>

Re: [PATCH 0/7] psi: pressure stall information for CPU, memory, and IO

2018-05-25 Thread Suren Baghdasaryan

Hi Johannes,
I tried your previous memdelay patches before this new set was posted
and results were promising for predicting when Android system is close
to OOM. I'm definitely going to try this one after I backport it to
4.9.

On Mon, May 7, 2018 at 2:01 PM, Johannes Weiner  wrote:
> Hi,
>
> I previously submitted a version of this patch set called "memdelay",
> which translated delays from reclaim, swap-in, thrashing page cache
> into a pressure percentage of lost walltime. I've since extended this
> code to aggregate all delay states tracked by delayacct in order to
> have generalized pressure/overcommit levels for CPU, memory, and IO.
>
> There was feedback from Peter on the previous version that I have
> incorporated as much as possible and as it still applies to this code:
>
> - got rid of the extra lock in the sched callbacks; all task
>   state changes we care about serialize through rq->lock
>
> - got rid of ktime_get() inside the sched callbacks and
>   switched time measuring to rq_clock()
>
> - got rid of all divisions inside the sched callbacks,
>   tracking everything natively in ns now
>
> I also moved this stuff into existing sched/stat.h callbacks, so it
> doesn't get in the way in sched/core.c, and of course moved the whole
> thing behind CONFIG_PSI since not everyone is going to want it.

Would it make sense to split CONFIG_PSI into CONFIG_PSI_CPU,
CONFIG_PSI_MEM and CONFIG_PSI_IO since one might need only specific
subset of this feature?

>
> Real-world applications
>
> Since the last posting, we've begun using the data collected by this
> code quite extensively at Facebook, and with several success stories.
>
> First we used it on systems that frequently locked up in low memory
> situations. The reason this happens is that the OOM killer is
> triggered by reclaim not being able to make forward progress, but with
> fast flash devices there is *always* some clean and uptodate cache to
> reclaim; the OOM killer never kicks in, even as tasks wait 80-90% of
> the time faulting executables. There is no situation where this ever
> makes sense in practice. We wrote a <100 line POC python script to
> monitor memory pressure and kill stuff manually, way before such
> pathological thrashing.
>
> We've since extended the python script into a more generic oomd that
> we use all over the place, not just to avoid livelocks but also to
> guarantee latency and throughput SLAs, since they're usually violated
> way before the kernel OOM killer would ever kick in.
>
> We also use the memory pressure info for loadshedding. Our batch job
> infrastructure used to refuse new requests on heuristics based on RSS
> and other existing VM metrics in an attempt to avoid OOM kills and
> maximize utilization. Since it was still plagued by frequent OOM
> kills, we switched it to shed load on psi memory pressure, which has
> turned out to be a much better bellwether, and we managed to reduce
> OOM kills drastically. Reducing the rate of OOM outages from the
> worker pool raised its aggregate productivity, and we were able to
> switch that service to smaller machines.
>
> Lastly, we use cgroups to isolate a machine's main workload from
> maintenance crap like package upgrades, logging, configuration, as
> well as to prevent multiple workloads on a machine from stepping on
> each others' toes. We were not able to do this properly without the
> pressure metrics; we would see latency or bandwidth drops, but it
> would often be hard to impossible to rootcause it post-mortem. We now
> log and graph the pressure metrics for all containers in our fleet and
> can trivially link service drops to resource pressure after the fact.
>
> How do you use this?
>
> A kernel with CONFIG_PSI=y will create a /proc/pressure directory with
> 3 files: cpu, memory, and io. If using cgroup2, cgroups will also have
> cpu.pressure, memory.pressure and io.pressure files, which simply
> calculate pressure at the cgroup level instead of system-wide.
>
> The cpu file contains one line:
>
> some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
>
> The averages give the percentage of walltime in which some tasks are
> delayed on the runqueue while another task has the CPU. They're recent
> averages over 10s, 1m, 5m windows, so you can tell short term trends
> from long term ones, similarly to the load average.
>
> What to make of this number? If CPU utilization is at 100% and CPU
> pressure is 0, it means the system is perfectly utilized, with one
> runnable thread per CPU and nobody waiting. At two or more runnable
> tasks per CPU, the system is 100% overcommitted and the pressure
> average will indicate as much. From a utilization perspective this is
> a great state of course: no CPU cycles are being wasted, even when 50%
> of the threads were to go idle (and most workloads do vary). From the
> perspective of the individual job it's not great, however, and they
> might do better with more resou

Re: [PATCH 3/6] psi: eliminate lazy clock mode

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 6:58 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:05AM -0800, Suren Baghdasaryan wrote:
> > Eliminate the idle mode and keep the worker doing 2s update intervals
> > at all times.
>
> That sounds like a bad deal.. esp. so for battery powered devices like
> say Andoird.
>
> In general the push has been to always idle everything, see NOHZ and
> NOHZ_FULL and all the work that's being put into getting rid of any and
> all period work.

Thanks for the feedback Peter! The removal of idle mode is unfortunate
but so far we could not find an elegant solution to handle 3 states
(IDLE / REGULAR / POLLING) without additional synchronization inside
the hotpath. The issue, as I remember it, was that while scheduling a
regular update inside psi_group_change() (IDLE to REGULAR transition)
we might override an earlier update being scheduled inside
psi_update_work(). I think we can solve that by using
mod_delayed_work_on() inside psi_update_work() but I might be missing
some other race. I'll discuss this again with Johannes and see if we
can synchronize all states using only atomic operations on clock_mode.

> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 4/6] psi: introduce state_mask to represent stalled psi states

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 7:55 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:06AM -0800, Suren Baghdasaryan wrote:
> > The psi monitoring patches will need to determine the same states as
> > record_times(). To avoid calculating them twice, maintain a state mask
> > that can be consulted cheaply. Do this in a separate patch to keep the
> > churn in the main feature patch at a minimum.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/psi_types.h |  3 +++
> >  kernel/sched/psi.c| 29 +++--
> >  2 files changed, 22 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > index 2cf422db5d18..2c6e9b67b7eb 100644
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -53,6 +53,9 @@ struct psi_group_cpu {
> >   /* States of the tasks belonging to this group */
> >   unsigned int tasks[NR_PSI_TASK_COUNTS];
> >
> > + /* Aggregate pressure state derived from the tasks */
> > + u32 state_mask;
> > +
> >   /* Period time sampling buckets for each state of interest (ns) */
> >   u32 times[NR_PSI_STATES];
> >
>
> Since we spend so much time counting space in that line, maybe add a
> note to the Changlog about how this fits.

Will do.

> Also, since I just had to re-count, you might want to add explicit
> numbers to the psi_res and psi_states enums.

Sounds reasonable.

> > + if (state_mask & (1 << s))
>
> We have the BIT() macro, but I'm honestly not sure that will improve
> things.

I was mimicking the rest of the code in psi.c that uses this kind of
bit masking. Can change if you think that would be better.

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:08AM -0800, Suren Baghdasaryan wrote:
> > +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> > + enum psi_states *state, u32 *threshold_us, u32 *win_sz_us)
> > +{
> > + bool some;
> > + bool threshold_pct;
> > + u32 threshold;
> > + u32 win_sz;
> > + char *p;
> > +
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + /* parse type */
> > + if (!strcmp(p, "some"))
> > + some = true;
> > + else if (!strcmp(p, "full"))
> > + some = false;
> > + else
> > + return -EINVAL;
> > +
> > + switch (res) {
> > + case (PSI_IO):
> > + *state = some ? PSI_IO_SOME : PSI_IO_FULL;
> > + break;
> > + case (PSI_MEM):
> > + *state = some ? PSI_MEM_SOME : PSI_MEM_FULL;
> > + break;
> > + case (PSI_CPU):
> > + if (!some)
> > + return -EINVAL;
> > + *state = PSI_CPU_SOME;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + while (isspace(*buf))
> > + buf++;
> > +
> > + p = strsep(&buf, "%");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + if (buf == NULL) {
> > + /* % sign was not found, threshold is specified in us */
> > + buf = p;
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + threshold_pct = false;
> > + } else
> > + threshold_pct = true;
> > +
> > + /* parse threshold */
> > + if (kstrtouint(p, 0, &threshold))
> > + return -EINVAL;
> > +
> > + while (isspace(*buf))
> > + buf++;
> > +
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + /* Parse window size */
> > + if (kstrtouint(p, 0, &win_sz))
> > + return -EINVAL;
> > +
> > + /* Check window size */
> > + if (win_sz < PSI_TRIG_MIN_WIN_US || win_sz > PSI_TRIG_MAX_WIN_US)
> > + return -EINVAL;
> > +
> > + if (threshold_pct)
> > + threshold = (threshold * win_sz) / 100;
> > +
> > + /* Check threshold */
> > + if (threshold == 0 || threshold > win_sz)
> > + return -EINVAL;
> > +
> > + *threshold_us = threshold;
> > + *win_sz_us = win_sz;
> > +
> > + return 0;
> > +}
>
> How well has this thing been fuzzed? Custom string parser, yay!

Honestly, not much. Normal cases and some obvious corner cases. Will
check if I can use some fuzzer to get more coverage or will write a
script.
I'm not thrilled about writing a custom parser, so if there is a
better way to handle this please advise.

> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 8:37 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:08AM -0800, Suren Baghdasaryan wrote:
> > @@ -358,28 +526,23 @@ static void psi_update_work(struct work_struct *work)
> >  {
> >   struct delayed_work *dwork;
> >   struct psi_group *group;
> > + u64 next_update;
> >
> >   dwork = to_delayed_work(work);
> >   group = container_of(dwork, struct psi_group, clock_work);
> >
> >   /*
> > +  * Periodically fold the per-cpu times and feed samples
> > +  * into the running averages.
> >*/
> >
> > + psi_update(group);
> >
> > + /* Calculate closest update time */
> > + next_update = min(group->polling_next_update,
> > + group->avg_next_update);
> > + schedule_delayed_work(dwork, min(PSI_FREQ,
> > + nsecs_to_jiffies(next_update - sched_clock()) + 1));
>
> See, so I don't at _all_ like how there is no idle option..

Copy that. Will see what we can do to bring it back.
Thanks!

> >  }
>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-18 Thread Suren Baghdasaryan

Current design supports only whole percentages and if userspace needs
more granularity then it has to use usecs.
I agree that usecs cover % usecase and "threshold * win / 100" is
simple enough for userspace to calculate. I'm fine with changing to
usecs only.

On Tue, Dec 18, 2018 at 9:30 AM Johannes Weiner  wrote:
>
> On Tue, Dec 18, 2018 at 11:46:22AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 17, 2018 at 05:21:05PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  
> > > wrote:
> >
> > > > How well has this thing been fuzzed? Custom string parser, yay!
> > >
> > > Honestly, not much. Normal cases and some obvious corner cases. Will
> > > check if I can use some fuzzer to get more coverage or will write a
> > > script.
> > > I'm not thrilled about writing a custom parser, so if there is a
> > > better way to handle this please advise.
> >
> > The grammar seems fairly simple, something like:
> >
> >   some-full = "some" | "full" ;
> >   threshold-abs = integer ;
> >   threshold-pct = integer, { "%" } ;
> >   threshold = threshold-abs | threshold-pct ;
> >   window = integer ;
> >   trigger = some-full, space, threshold, space, window ;
> >
> > And that could even be expressed as two scanf formats:
> >
> >  "%4s %u%% %u" , "%4s %u %u"
> >
> > which then gets your something like:
> >
> >   char type[5];
> >
> >   if (sscanf(input, "%4s %u%% %u", &type, &pct, &window) == 3) {
> >   // do pct thing
> >   } else if (sscanf(intput, "%4s %u %u", &type, &thres, &window) == 3) {
> >   // do abs thing
> >   } else return -EFAIL;
> >
> >   if (!strcmp(type, "some")) {
> >   // some
> >   } else if (!strcmp(type, "full")) {
> >   // full
> >   } else return -EFAIL;
> >
> >   // do more
>
> We might want to drop the percentage notation.
>
> While it's somewhat convenient, it's also not unreasonable to ask
> userspace to do a simple "threshold * win / 100" themselves, and it
> would simplify the interface spec and the parser.
>
> Sure, psi outputs percentages, but only for fixed window sizes, so
> that actually saves us something, whereas this parser here needs to
> take a fractional anyway. The output is also in decimal notation,
> which is necessary for granularity. And I really don't think we want
> to add float parsing on top of this interface spec.
>
> So neither the convenience nor the symmetry argument are very
> compelling IMO. It might be better to just not go there.

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-18 Thread Suren Baghdasaryan

On Tue, Dec 18, 2018 at 11:18 AM Joel Fernandes  wrote:
>
> On Tue, Dec 18, 2018 at 9:58 AM 'Suren Baghdasaryan' via kernel-team
>  wrote:
> >
> > Current design supports only whole percentages and if userspace needs
> > more granularity then it has to use usecs.
> > I agree that usecs cover % usecase and "threshold * win / 100" is
> > simple enough for userspace to calculate. I'm fine with changing to
> > usecs only.
>
> Suren, please avoid top-posting to LKML.

Sorry, did that by accident.

> Also I was going to say the same thing, just usecs only is better.

Thanks for the input.

> thanks,
>
>  - Joel


> > On Tue, Dec 18, 2018 at 9:30 AM Johannes Weiner  wrote:
> > >
> > > On Tue, Dec 18, 2018 at 11:46:22AM +0100, Peter Zijlstra wrote:
> > > > On Mon, Dec 17, 2018 at 05:21:05PM -0800, Suren Baghdasaryan wrote:
> > > > > On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  
> > > > > wrote:
> > > >
> > > > > > How well has this thing been fuzzed? Custom string parser, yay!
> > > > >
> > > > > Honestly, not much. Normal cases and some obvious corner cases. Will
> > > > > check if I can use some fuzzer to get more coverage or will write a
> > > > > script.
> > > > > I'm not thrilled about writing a custom parser, so if there is a
> > > > > better way to handle this please advise.
> > > >
> > > > The grammar seems fairly simple, something like:
> > > >
> > > >   some-full = "some" | "full" ;
> > > >   threshold-abs = integer ;
> > > >   threshold-pct = integer, { "%" } ;
> > > >   threshold = threshold-abs | threshold-pct ;
> > > >   window = integer ;
> > > >   trigger = some-full, space, threshold, space, window ;
> > > >
> > > > And that could even be expressed as two scanf formats:
> > > >
> > > >  "%4s %u%% %u" , "%4s %u %u"
> > > >
> > > > which then gets your something like:
> > > >
> > > >   char type[5];
> > > >
> > > >   if (sscanf(input, "%4s %u%% %u", &type, &pct, &window) == 3) {
> > > >   // do pct thing
> > > >   } else if (sscanf(intput, "%4s %u %u", &type, &thres, &window) == 3) {
> > > >   // do abs thing
> > > >   } else return -EFAIL;
> > > >
> > > >   if (!strcmp(type, "some")) {
> > > >   // some
> > > >   } else if (!strcmp(type, "full")) {
> > > >   // full
> > > >   } else return -EFAIL;
> > > >
> > > >   // do more
> > >
> > > We might want to drop the percentage notation.
> > >
> > > While it's somewhat convenient, it's also not unreasonable to ask
> > > userspace to do a simple "threshold * win / 100" themselves, and it
> > > would simplify the interface spec and the parser.
> > >
> > > Sure, psi outputs percentages, but only for fixed window sizes, so
> > > that actually saves us something, whereas this parser here needs to
> > > take a fractional anyway. The output is also in decimal notation,
> > > which is necessary for granularity. And I really don't think we want
> > > to add float parsing on top of this interface spec.
> > >
> > > So neither the convenience nor the symmetry argument are very
> > > compelling IMO. It might be better to just not go there.
> >
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "kernel-team" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to kernel-team+unsubscr...@android.com.
> >

[PATCH 5/6] psi: rename psi fields in preparation for psi trigger addition

2018-12-14 Thread Suren Baghdasaryan

Renaming psi_group structure member fields used for calculating psi
totals and averages for clear distinction between them and trigger-related
fields that will be added next.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h | 15 ---
 kernel/sched/psi.c| 26 ++
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2c6e9b67b7eb..11b32b3395a2 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -69,20 +69,21 @@ struct psi_group_cpu {
 };
 
 struct psi_group {
-   /* Protects data updated during an aggregation */
-   struct mutex stat_lock;
+   /* Protects data used by the aggregator */
+   struct mutex update_lock;
 
/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;
 
-   /* Periodic aggregation state */
-   u64 total_prev[NR_PSI_STATES - 1];
-   u64 last_update;
-   u64 next_update;
struct delayed_work clock_work;
 
-   /* Total stall times and sampled pressure averages */
+   /* Total stall times observed */
u64 total[NR_PSI_STATES - 1];
+
+   /* Running pressure averages */
+   u64 avg_total[NR_PSI_STATES - 1];
+   u64 avg_last_update;
+   u64 avg_next_update;
unsigned long avg[NR_PSI_STATES - 1][3];
 };
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 153c0624976b..694edefdd333 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -172,9 +172,9 @@ static void group_init(struct psi_group *group)
 
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
-   group->next_update = sched_clock() + psi_period;
+   group->avg_next_update = sched_clock() + psi_period;
INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
-   mutex_init(&group->stat_lock);
+   mutex_init(&group->update_lock);
 }
 
 void __init psi_init(void)
@@ -268,7 +268,7 @@ static void update_stats(struct psi_group *group)
int cpu;
int s;
 
-   mutex_lock(&group->stat_lock);
+   mutex_lock(&group->update_lock);
 
/*
 * Collect the per-cpu time buckets and average them into a
@@ -309,7 +309,7 @@ static void update_stats(struct psi_group *group)
 
/* avgX= */
now = sched_clock();
-   expires = group->next_update;
+   expires = group->avg_next_update;
if (now < expires)
goto out;
 
@@ -320,14 +320,14 @@ static void update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + psi_period;
-   period = now - group->last_update;
-   group->last_update = now;
+   group->avg_next_update = expires + psi_period;
+   period = now - group->avg_last_update;
+   group->avg_last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
u32 sample;
 
-   sample = group->total[s] - group->total_prev[s];
+   sample = group->total[s] - group->avg_total[s];
/*
 * Due to the lockless sampling of the time buckets,
 * recorded time deltas can slip into the next period,
@@ -347,11 +347,11 @@ static void update_stats(struct psi_group *group)
 */
if (sample > period)
sample = period;
-   group->total_prev[s] += sample;
+   group->avg_total[s] += sample;
calc_avgs(group->avg[s], sample, period);
}
 out:
-   mutex_unlock(&group->stat_lock);
+   mutex_unlock(&group->update_lock);
 }
 
 static void psi_update_work(struct work_struct *work)
@@ -375,8 +375,10 @@ static void psi_update_work(struct work_struct *work)
update_stats(group);
 
now = sched_clock();
-   if (group->next_update > now)
-   delay = nsecs_to_jiffies(group->next_update - now) + 1;
+   if (group->avg_next_update > now) {
+   delay = nsecs_to_jiffies(
+   group->avg_next_update - now) + 1;
+   }
schedule_delayed_work(dwork, delay);
 }
 
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 2/6] kernel: cgroup: add poll file operation

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/cgroup-defs.h |  4 
 kernel/cgroup/cgroup.c  | 12 
 2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..6f9ea8601421 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -32,6 +32,7 @@ struct kernfs_node;
 struct kernfs_ops;
 struct kernfs_open_file;
 struct seq_file;
+struct poll_table_struct;
 
 #define MAX_CGROUP_TYPE_NAMELEN 32
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -573,6 +574,9 @@ struct cftype {
ssize_t (*write)(struct kernfs_open_file *of,
 char *buf, size_t nbytes, loff_t off);
 
+   __poll_t (*poll)(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key   lockdep_key;
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6aaf5dd5383b..ffcd7483b8ee 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3499,6 +3499,16 @@ static ssize_t cgroup_file_write(struct kernfs_open_file 
*of, char *buf,
return ret ?: nbytes;
 }
 
+static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt)
+{
+   struct cftype *cft = of->kn->priv;
+
+   if (cft->poll)
+   return cft->poll(of, pt);
+
+   return kernfs_generic_poll(of, pt);
+}
+
 static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
 {
return seq_cft(seq)->seq_start(seq, ppos);
@@ -3537,6 +3547,7 @@ static struct kernfs_ops cgroup_kf_single_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_show   = cgroup_seqfile_show,
 };
 
@@ -3545,6 +3556,7 @@ static struct kernfs_ops cgroup_kf_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_start  = cgroup_seqfile_start,
.seq_next   = cgroup_seqfile_next,
.seq_stop   = cgroup_seqfile_stop,
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 1/6] fs: kernfs: add poll file operation

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 fs/kernfs/file.c   | 31 ---
 include/linux/kernfs.h |  6 ++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index dbf5bc250bfd..2d8b91f4475d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -832,26 +832,35 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
  * to see if it supports poll (Neither 'poll' nor 'select' return
  * an appropriate error code).  When in doubt, set a suitable timeout value.
  */
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of, poll_table *wait)
+{
+   struct kernfs_node *kn = kernfs_dentry_node(of->file->f_path.dentry);
+   struct kernfs_open_node *on = kn->attr.open;
+
+   poll_wait(of->file, &on->poll, wait);
+
+   if (of->event != atomic_read(&on->event))
+   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+
+   return DEFAULT_POLLMASK;
+}
+
 static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait)
 {
struct kernfs_open_file *of = kernfs_of(filp);
struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
-   struct kernfs_open_node *on = kn->attr.open;
+   __poll_t ret;
 
if (!kernfs_get_active(kn))
-   goto trigger;
+   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
 
-   poll_wait(filp, &on->poll, wait);
+   if (kn->attr.ops->poll)
+   ret = kn->attr.ops->poll(of, wait);
+   else
+   ret = kernfs_generic_poll(of, wait);
 
kernfs_put_active(kn);
-
-   if (of->event != atomic_read(&on->event))
-   goto trigger;
-
-   return DEFAULT_POLLMASK;
-
- trigger:
-   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+   return ret;
 }
 
 static void kernfs_notify_workfn(struct work_struct *work)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5b36b1287a5a..0cac1207bb00 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -25,6 +25,7 @@ struct seq_file;
 struct vm_area_struct;
 struct super_block;
 struct file_system_type;
+struct poll_table_struct;
 
 struct kernfs_open_node;
 struct kernfs_iattrs;
@@ -261,6 +262,9 @@ struct kernfs_ops {
ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes,
 loff_t off);
 
+   __poll_t (*poll)(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
+
int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -350,6 +354,8 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, 
const char *name,
 int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 const char *new_name, const void *new_ns);
 int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
 void kernfs_notify(struct kernfs_node *kn);
 
 const void *kernfs_super_ns(struct super_block *sb);
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 3/6] psi: eliminate lazy clock mode

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

psi currently stops its periodic 2s aggregation runs when there has
not been any task activity, and wakes it back up later from the
scheduler when the system returns from the idle state.

The coordination between the aggregation worker and the scheduler is
minimal: the scheduler has to nudge the worker if it's not running,
and the worker will reschedule itself periodically until it detects no
more activity.

The polling patches will complicate this, because they introduce
another aggregation mode for high-frequency polling that also
eventually times out if the worker sees no more activity of interest.
That means the scheduler portion would have to coordinate three state
transitions - idle to regular, regular to polling, idle to polling -
with the worker's timeouts and self-rescheduling. The additional
overhead from this is undesirable in the scheduler hotpath.

Eliminate the idle mode and keep the worker doing 2s update intervals
at all times. This eliminates worker coordination from the scheduler
completely. The polling patches will then add it back to switch
between regular mode and high-frequency polling mode.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 55 +++---
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc93..d2b9c9a1a62f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -248,18 +248,10 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
}
 }
 
-static void calc_avgs(unsigned long avg[3], int missed_periods,
- u64 time, u64 period)
+static void calc_avgs(unsigned long avg[3], u64 time, u64 period)
 {
unsigned long pct;
 
-   /* Fill in zeroes for periods of no activity */
-   if (missed_periods) {
-   avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
-   avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
-   avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
-   }
-
/* Sample the most recent active period */
pct = div_u64(time * 100, period);
pct *= FIXED_1;
@@ -268,10 +260,9 @@ static void calc_avgs(unsigned long avg[3], int 
missed_periods,
avg[2] = calc_load(avg[2], EXP_300s, pct);
 }
 
-static bool update_stats(struct psi_group *group)
+static void update_stats(struct psi_group *group)
 {
u64 deltas[NR_PSI_STATES - 1] = { 0, };
-   unsigned long missed_periods = 0;
unsigned long nonidle_total = 0;
u64 now, expires, period;
int cpu;
@@ -321,8 +312,6 @@ static bool update_stats(struct psi_group *group)
expires = group->next_update;
if (now < expires)
goto out;
-   if (now - expires > psi_period)
-   missed_periods = div_u64(now - expires, psi_period);
 
/*
 * The periodic clock tick can get delayed for various
@@ -331,8 +320,8 @@ static bool update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + ((1 + missed_periods) * psi_period);
-   period = now - (group->last_update + (missed_periods * psi_period));
+   group->next_update = expires + psi_period;
+   period = now - group->last_update;
group->last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
@@ -359,18 +348,18 @@ static bool update_stats(struct psi_group *group)
if (sample > period)
sample = period;
group->total_prev[s] += sample;
-   calc_avgs(group->avg[s], missed_periods, sample, period);
+   calc_avgs(group->avg[s], sample, period);
}
 out:
mutex_unlock(&group->stat_lock);
-   return nonidle_total;
 }
 
 static void psi_update_work(struct work_struct *work)
 {
struct delayed_work *dwork;
struct psi_group *group;
-   bool nonidle;
+   unsigned long delay = 0;
+   u64 now;
 
dwork = to_delayed_work(work);
group = container_of(dwork, struct psi_group, clock_work);
@@ -383,17 +372,12 @@ static void psi_update_work(struct work_struct *work)
 * go - see calc_avgs() and missed_periods.
 */
 
-   nonidle = update_stats(group);
-
-   if (nonidle) {
-   unsigned long delay = 0;
-   u64 now;
+   update_stats(group);
 
-   now = sched_clock();
-   if (group->next_update > now)
-   delay = nsecs_to_jiffies(group->next_update - now) + 1;
-   schedule_delayed_work(dwork, delay);
-   }
+   now = sched_clock();
+   if (group->next_update > now)

[PATCH 6/6] psi: introduce psi monitor

2018-12-14 Thread Suren Baghdasaryan

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window is expressed in usecs and threshold can be expressed in
usecs or percentages of the tracking window. Multiple psi resources
with different thresholds and window sizes can be monitored concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan 
---
 Documentation/accounting/psi.txt | 105 +++
 include/linux/psi.h  |  10 +
 include/linux/psi_types.h|  72 +
 kernel/cgroup/cgroup.c   | 107 ++-
 kernel/sched/psi.c   | 510 +--
 5 files changed, 774 insertions(+), 30 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..b006cc84ad44 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -63,6 +63,108 @@ tracked and exported as well, to allow detection of latency 
spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.
 
+Monitoring for pressure thresholds
+==
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used:
+
+  
+
+For example writing "some 15% 100" or "some 15 100" into
+/proc/pressure/memory would add 15% (150ms) threshold for partial memory
+stall measured within 1sec time window. Writing "full 5% 100" or
+"full 5 100" into /proc/pressure/io would add 5% (50ms) threshold
+for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s.
+
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+
+Notifications to the userspace are rate-limited to one per tracking window.
+
+The trigger will de-register when the file descriptor used to define the
+trigger  is closed.
+
+Userspace monitor usage example
+===
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Monitor memory partial stall with 1s tracking window size
+ * and 15% (150ms) threshold.
+ */
+int main() {
+   const char trig[] = "some 15% 100";
+   struct pollfd fds;
+   int n;
+
+   fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+   if (fds.fd < 0) {
+   printf("/proc/pressure/memory open error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+   fds.events = POLLPRI;
+
+   if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+   printf("/proc/pressure/memory write error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+
+   printf("waiting for events...\n");
+   while (1) {
+   n = poll(&fds, 1, -1);
+   if (n <

[PATCH 4/6] psi: introduce state_mask to represent stalled psi states

2018-12-14 Thread Suren Baghdasaryan

The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h |  3 +++
 kernel/sched/psi.c| 29 +++--
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2cf422db5d18..2c6e9b67b7eb 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -53,6 +53,9 @@ struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];
 
+   /* Aggregate pressure state derived from the tasks */
+   u32 state_mask;
+
/* Period time sampling buckets for each state of interest (ns) */
u32 times[NR_PSI_STATES];
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d2b9c9a1a62f..153c0624976b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,17 +212,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
 {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
-   unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
+   enum psi_states s;
unsigned int seq;
-   int s;
+   u32 state_mask;
 
/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
now = cpu_clock(cpu);
memcpy(times, groupc->times, sizeof(groupc->times));
-   memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+   state_mask = groupc->state_mask;
state_start = groupc->state_start;
} while (read_seqcount_retry(&groupc->seq, seq));
 
@@ -238,7 +238,7 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
 * (u32) and our reported pressure close to what's
 * actually happening.
 */
-   if (test_state(tasks, s))
+   if (state_mask & (1 << s))
times[s] += now - state_start;
 
delta = times[s] - groupc->times_prev[s];
@@ -390,15 +390,15 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
delta = now - groupc->state_start;
groupc->state_start = now;
 
-   if (test_state(groupc->tasks, PSI_IO_SOME)) {
+   if (groupc->state_mask & (1 << PSI_IO_SOME)) {
groupc->times[PSI_IO_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_IO_FULL))
+   if (groupc->state_mask & (1 << PSI_IO_FULL))
groupc->times[PSI_IO_FULL] += delta;
}
 
-   if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+   if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
groupc->times[PSI_MEM_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_MEM_FULL))
+   if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
else if (memstall_tick) {
u32 sample;
@@ -419,10 +419,10 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
}
}
 
-   if (test_state(groupc->tasks, PSI_CPU_SOME))
+   if (groupc->state_mask & (1 << PSI_CPU_SOME))
groupc->times[PSI_CPU_SOME] += delta;
 
-   if (test_state(groupc->tasks, PSI_NONIDLE))
+   if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
 }
 
@@ -431,6 +431,8 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 {
struct psi_group_cpu *groupc;
unsigned int t, m;
+   enum psi_states s;
+   u32 state_mask = 0;
 
groupc = per_cpu_ptr(group->pcpu, cpu);
 
@@ -463,6 +465,13 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
if (set & (1 << t))
groupc->tasks[t]++;
 
+   /* Calculate state mask representing active states */
+   for (s = 0; s < NR_PSI_STATES; s++) {
+   if (test_state(groupc->tasks, s))
+   state_mask |= (1 << s);
+   }
+   groupc->state_mask = state_mask;
+
write_seqcount_end(&groupc->seq);
 }
 
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 0/6] psi: pressure stall monitors

2018-12-14 Thread Suren Baghdasaryan

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high-enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to
be notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.
For example, using memory stall monitors in userspace low memory
killer daemon (lmkd) we can detect mounting pressure and kill less
important processes before device becomes visibly sluggish. In our
memory stress testing psi memory monitors produce roughly 10x less
false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts
of Android framework to monitor memory state of the device and act
accordingly.

The new interface is straight-forward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:

/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 10 100"
fd = open("/proc/pressure/memory")
write(fd, trigger, sizeof(trigger))
while (poll() >= 0) {
...
};
close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in
order to emit event signals in a timely fashion. Once the stalling
subsides, aggregation reverts back to normal.

The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-5 prepare the psi code for polling support. Patch 6
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.

The patches were developed in collaboration with Johannes Weiner.

The patches are based on 4.20-rc6.

Johannes Weiner (3):
  fs: kernfs: add poll file operation
  kernel: cgroup: add poll file operation
  psi: eliminate lazy clock mode

Suren Baghdasaryan (3):
  psi: introduce state_mask to represent stalled psi states
  psi: rename psi fields in preparation for psi trigger addition
  psi: introduce psi monitor

 Documentation/accounting/psi.txt | 105 ++
 fs/kernfs/file.c |  31 +-
 include/linux/cgroup-defs.h  |   4 +
 include/linux/kernfs.h   |   6 +
 include/linux/psi.h  |  10 +
 include/linux/psi_types.h|  90 -
 kernel/cgroup/cgroup.c   | 119 ++-
 kernel/sched/psi.c   | 586 +++
 8 files changed, 865 insertions(+), 86 deletions(-)

-- 
2.20.0.405.gbc1bbc6f85-goog

Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-18 Thread Suren Baghdasaryan

Hi Daniel,

On Sun, Sep 16, 2018 at 10:22 PM, Daniel Drake  wrote:
> Hi Suren
>
> On Fri, Sep 7, 2018 at 11:58 PM, Suren Baghdasaryan  wrote:
>> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
>> code system running Android. Signals behave as expected reacting to
>> memory pressure, no jumps in "total" counters that would indicate an
>> overflow/underflow issues. Nicely done!
>
> Can you share your Linux v4.9 psi backport somewhere?
>

Absolutely. Let me figure out what's the best way to do share that and
make sure they apply cleanly on official 4.9 (I was using vendor's
tree for testing). Will need a day or so to get this done.
In case you need them sooner, there were several "prerequisite"
patches that I had to backport to make PSI backporting
easier/possible. Following is the list as shown by "git log
--oneline":

PSI patches:

ef94c067f360 psi: cgroup support
60081a7aeb0b psi: pressure stall information for CPU, memory, and IO
acd2a16497e9 sched: introduce this_rq_lock_irq()
f30268c29309 sched: sched.h: make rq locking and clock functions
available in stats.h
a2fd1c94b743 sched: loadavg: make calc_load_n() public
32a74dec4967 sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
8e3991dd1a73 delayacct: track delays from thrashing cache pages
4ae940e7e6ff mm: workingset: tell cache transitions from workingset thrashing
e9ccd63399e0 mm: workingset: don't drop refault information prematurely

Prerequisites:

b5a58c778c54 workqueue: make workqueue available early during boot
ae5f39ee13b5 sched/core: Add wrappers for lockdep_(un)pin_lock()
7276f98a72c1 sched/headers, delayacct: Move the 'struct
task_delay_info' definition from  to

287318d13688 mm: add PageWaiters indicating tasks are waiting for a page bit
edfa64560aaa sched/headers: Remove  from 
f6b6ba853959 sched/headers: Move loadavg related definitions from
 to 
395b0a9f7aae sched/headers: Prepare for new header dependencies before
moving code to 

PSI patches needed some adjustments but nothing really major.

> Thanks
> Daniel

Thanks,
Suren.

Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-18 Thread Suren Baghdasaryan

On Mon, Sep 17, 2018 at 6:29 AM, peter enderborg
 wrote:
> Will it be part of the backport to 4.9 google android or is it for test only?

Currently I'm testing these patches in tandem with PSI monitor that
I'm developing and test results look good. If things go well and we
start using PSI for Android I will try to upstream the backport. If
upstream rejects it we will have to merge it into Android common
kernel repo as a last resort. Hope this answers your question.

> I guess that this patch is to big for the LTS tree.
>
> On 09/07/2018 05:58 PM, Suren Baghdasaryan wrote:
>> Thanks for the new patchset! Backported to 4.9 and retested on ARMv8 8
>> code system running Android. Signals behave as expected reacting to
>> memory pressure, no jumps in "total" counters that would indicate an
>> overflow/underflow issues. Nicely done!
>>
>> Tested-by: Suren Baghdasaryan 
>>
>> On Fri, Sep 7, 2018 at 8:09 AM, Johannes Weiner  wrote:
>>> On Fri, Sep 07, 2018 at 01:04:07PM +0200, Peter Zijlstra wrote:
>>>> So yeah, grudingly acked. Did you want me to pick this up through the
>>>> scheduler tree since most of this lives there?
>>> Thanks for the ack.
>>>
>>> As for routing it, I'll leave that decision to you and Andrew. It
>>> touches stuff all over, so it could result in quite a few conflicts
>>> between trees (although I don't expect any of them to be non-trivial).
>
>

Thanks,
Suren.

Re: [PATCH RFC v3 12/13] mm: add SLAB_TYPESAFE_BY_RCU to files_cache

2024-08-13 Thread Suren Baghdasaryan

On Mon, Aug 12, 2024 at 11:07 PM Mateusz Guzik  wrote:
>
> On Mon, Aug 12, 2024 at 09:29:16PM -0700, Andrii Nakryiko wrote:
> > Add RCU protection for file struct's backing memory by adding
> > SLAB_TYPESAFE_BY_RCU flag to files_cachep. This will allow to locklessly
> > access struct file's fields under RCU lock protection without having to
> > take much more expensive and contended locks.
> >
> > This is going to be used for lockless uprobe look up in the next patch.
> >
> > Suggested-by: Matthew Wilcox 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/fork.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 76ebafb956a6..91ecc32a491c 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -3157,8 +3157,8 @@ void __init proc_caches_init(void)
> >   NULL);
> >   files_cachep = kmem_cache_create("files_cache",
> >   sizeof(struct files_struct), 0,
> > - SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
> > - NULL);
> > + SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
> > + SLAB_ACCOUNT, NULL);
> >   fs_cachep = kmem_cache_create("fs_cache",
> >   sizeof(struct fs_struct), 0,
> >   SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
>
> Did you mean to add it to the cache backing 'struct file' allocations?
>
> That cache is created in fs/file_table.c and already has the flag:
> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN |
> SLAB_PANIC | SLAB_ACCOUNT, NULL);

Oh, I completely missed the SLAB_TYPESAFE_BY_RCU for this cache, and
here I was telling Andrii that it's RCU unsafe to access
vma->vm_file... Mea culpa.

>
> The cache you are modifying in this patch contains the fd array et al
> and is of no consequence to "uprobes: add speculative lockless VMA to
> inode resolution".
>
> iow this patch needs to be dropped

I believe you are correct.

Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

2024-08-13 Thread Suren Baghdasaryan

On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik  wrote:
>
> On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
> > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > attempting uprobe look up speculatively.
> >
> > We rely on newly added mmap_lock_speculation_{start,end}() helpers to
> > validate that mm_struct stays intact for entire duration of this
> > speculation. If not, we fall back to mmap_lock-protected lookup.
> >
> > This allows to avoid contention on mmap_lock in absolutely majority of
> > cases, nicely improving uprobe/uretprobe scalability.
> >
>
> Here I have to admit to being mostly ignorant about the mm, so bear with
> me. :>
>
> I note the result of find_active_uprobe_speculative is immediately stale
> in face of modifications.
>
> The thing I'm after is that the mmap_lock_speculation business adds
> overhead on archs where a release fence is not a de facto nop and I
> don't believe the commit message justifies it. Definitely a bummer to
> add merely it for uprobes. If there are bigger plans concerning it
> that's a different story of course.
>
> With this in mind I have to ask if instead you could perhaps get away
> with the already present per-vma sequence counter?

per-vma sequence counter does not implement acquire/release logic, it
relies on vma->vm_lock for synchronization. So if we want to use it,
we would have to add additional memory barriers here. This is likely
possible but as I mentioned before we would need to ensure the
pagefault path does not regress. OTOH mm->mm_lock_seq already halfway
there (it implements acquire/release logic), we just had to ensure
mmap_write_lock() increments mm->mm_lock_seq.

So, from the release fence overhead POV I think whether we use
mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence
here.

Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

2024-08-15 Thread Suren Baghdasaryan

On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko
 wrote:
>
> On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik  wrote:
> >
> > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik  wrote:
> > > >
> > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
> > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > > > > attempting uprobe look up speculatively.
> > > > >
> > > > > We rely on newly added mmap_lock_speculation_{start,end}() helpers to
> > > > > validate that mm_struct stays intact for entire duration of this
> > > > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > > >
> > > > > This allows to avoid contention on mmap_lock in absolutely majority of
> > > > > cases, nicely improving uprobe/uretprobe scalability.
> > > > >
> > > >
> > > > Here I have to admit to being mostly ignorant about the mm, so bear with
> > > > me. :>
> > > >
> > > > I note the result of find_active_uprobe_speculative is immediately stale
> > > > in face of modifications.
> > > >
> > > > The thing I'm after is that the mmap_lock_speculation business adds
> > > > overhead on archs where a release fence is not a de facto nop and I
> > > > don't believe the commit message justifies it. Definitely a bummer to
> > > > add merely it for uprobes. If there are bigger plans concerning it
> > > > that's a different story of course.
> > > >
> > > > With this in mind I have to ask if instead you could perhaps get away
> > > > with the already present per-vma sequence counter?
> > >
> > > per-vma sequence counter does not implement acquire/release logic, it
> > > relies on vma->vm_lock for synchronization. So if we want to use it,
> > > we would have to add additional memory barriers here. This is likely
> > > possible but as I mentioned before we would need to ensure the
> > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway
> > > there (it implements acquire/release logic), we just had to ensure
> > > mmap_write_lock() increments mm->mm_lock_seq.
> > >
> > > So, from the release fence overhead POV I think whether we use
> > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence
> > > here.
> > >
> >
> > Per my previous e-mail I'm not particularly familiar with mm internals,
> > so I'm going to handwave a little bit with my $0,03 concerning multicore
> > in general and if you disagree with it that's your business. For the
> > time being I have no interest in digging into any of this.
> >
> > Before I do, to prevent this thread from being a total waste, here are
> > some remarks concerning the patch with the assumption that the core idea
> > lands.
> >
> > From the commit message:
> > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
> > > attempting uprobe look up speculatively.
> >
> > Just in case I'll note a nit that this paragraph will need to be removed
> > since the patch adding the flag is getting dropped.
>
> Yep, of course, I'll update all that for the next revision (I'll wait
> for non-RFC patches to land first before reposting).
>
> >
> > A non-nit which may or may not end up mattering is that the flag (which
> > *is* set on the filep slab cache) makes things more difficult to
> > validate. Normal RCU usage guarantees that the object itself wont be
> > freed as long you follow the rules. However, the SLAB_TYPESAFE_BY_RCU
> > flag weakens it significantly -- the thing at hand will always be a
> > 'struct file', but it may get reallocated to *another* file from under
> > you. Whether this aspect plays a role here I don't know.
>
> Yes, that's ok and is accounted for. We care about that memory not
> going even from under us (I'm not even sure if it matters that it is
> still a struct file, tbh; I think that shouldn't matter as we are
> prepared to deal with completely garbage values read from struct
> file).

Correct, with SLAB_TYPESAFE_BY_RCU we do need an additional check that
vma->vm

Re: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution

2024-08-15 Thread Suren Baghdasaryan

On Thu, Aug 15, 2024 at 11:58 AM Jann Horn  wrote:
>
> +brauner for "struct file" lifetime
>
> On Thu, Aug 15, 2024 at 7:45 PM Suren Baghdasaryan  wrote:
> > On Thu, Aug 15, 2024 at 9:47 AM Andrii Nakryiko
> >  wrote:
> > >
> > > On Thu, Aug 15, 2024 at 6:44 AM Mateusz Guzik  wrote:
> > > >
> > > > On Tue, Aug 13, 2024 at 08:36:03AM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Aug 12, 2024 at 11:18 PM Mateusz Guzik  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Aug 12, 2024 at 09:29:17PM -0700, Andrii Nakryiko wrote:
> > > > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely 
> > > > > > > access
> > > > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() 
> > > > > > > protection,
> > > > > > > attempting uprobe look up speculatively.
>
> Stupid question: Is this uprobe stuff actually such a hot codepath
> that it makes sense to optimize it to be faster than the page fault
> path?
>
> (Sidenote: I find it kinda interesting that this is sort of going back
> in the direction of the old Speculative Page Faults design.)
>
> > > > > > > We rely on newly added mmap_lock_speculation_{start,end}() 
> > > > > > > helpers to
> > > > > > > validate that mm_struct stays intact for entire duration of this
> > > > > > > speculation. If not, we fall back to mmap_lock-protected lookup.
> > > > > > >
> > > > > > > This allows to avoid contention on mmap_lock in absolutely 
> > > > > > > majority of
> > > > > > > cases, nicely improving uprobe/uretprobe scalability.
> > > > > > >
> > > > > >
> > > > > > Here I have to admit to being mostly ignorant about the mm, so bear 
> > > > > > with
> > > > > > me. :>
> > > > > >
> > > > > > I note the result of find_active_uprobe_speculative is immediately 
> > > > > > stale
> > > > > > in face of modifications.
> > > > > >
> > > > > > The thing I'm after is that the mmap_lock_speculation business adds
> > > > > > overhead on archs where a release fence is not a de facto nop and I
> > > > > > don't believe the commit message justifies it. Definitely a bummer 
> > > > > > to
> > > > > > add merely it for uprobes. If there are bigger plans concerning it
> > > > > > that's a different story of course.
> > > > > >
> > > > > > With this in mind I have to ask if instead you could perhaps get 
> > > > > > away
> > > > > > with the already present per-vma sequence counter?
> > > > >
> > > > > per-vma sequence counter does not implement acquire/release logic, it
> > > > > relies on vma->vm_lock for synchronization. So if we want to use it,
> > > > > we would have to add additional memory barriers here. This is likely
> > > > > possible but as I mentioned before we would need to ensure the
> > > > > pagefault path does not regress. OTOH mm->mm_lock_seq already halfway
> > > > > there (it implements acquire/release logic), we just had to ensure
> > > > > mmap_write_lock() increments mm->mm_lock_seq.
> > > > >
> > > > > So, from the release fence overhead POV I think whether we use
> > > > > mm->mm_lock_seq or vma->vm_lock, we would still need a proper fence
> > > > > here.
> > > > >
> > > >
> > > > Per my previous e-mail I'm not particularly familiar with mm internals,
> > > > so I'm going to handwave a little bit with my $0,03 concerning multicore
> > > > in general and if you disagree with it that's your business. For the
> > > > time being I have no interest in digging into any of this.
> > > >
> > > > Before I do, to prevent this thread from being a total waste, here are
> > > > some remarks concerning the patch with the assumption that the core idea
> > > > lands.
> > > >
> > > > From the commit message:
> > > > > Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
> > > > > vma->vm_file->f_inode lockless only under rcu_read_lock() protect

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Suren Baghdasaryan

Hi Folks,

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin  wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> > Proposal: Provide memory guarantees to userspace oom-killer.
> >
> > Background:
> >
> > Issues with kernel oom-killer:
> > 1. Very conservative and prefer to reclaim. Applications can suffer
> > for a long time.
> > 2. Borrows the context of the allocator which can be resource limited
> > (low sched priority or limited CPU quota).
> > 3. Serialized by global lock.
> > 4. Very simplistic oom victim selection policy.
> >
> > These issues are resolved through userspace oom-killer by:
> > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> > early detect suffering.
> > 2. Independent process context which can be given dedicated CPU quota
> > and high scheduling priority.
> > 3. Can be more aggressive as required.
> > 4. Can implement sophisticated business logic/policies.
> >
> > Android's LMKD and Facebook's oomd are the prime examples of userspace
> > oom-killers. One of the biggest challenges for userspace oom-killers
> > is to potentially function under intense memory pressure and are prone
> > to getting stuck in memory reclaim themselves. Current userspace
> > oom-killers aim to avoid this situation by preallocating user memory
> > and protecting themselves from global reclaim by either mlocking or
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
> >
> > Our attempt of userspace oom-killer faces similar challenges.
> > Particularly at the tail on the very highly utilized machines we have
> > observed userspace oom-killer spectacularly failing in many possible
> > ways in the direct reclaim. We have seen oom-killer stuck in direct
> > reclaim throttling, stuck in reclaim and allocations from interrupts
> > keep stealing reclaimed memory. We have even observed systems where
> > all the processes were stuck in throttle_direct_reclaim() and only
> > kswapd was running and the interrupts kept stealing the memory
> > reclaimed by kswapd.
> >
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer. At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

I tend to agree here. This is how we are trying to avoid issues with
such severe memory shortages - by tuning the killer a bit more
aggressively. But a more reliable mechanism would definitely be an
improvement.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.
>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a 
> significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to 
> kill
> some processes and make the forward progress.

This sounds like a good idea to me.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
> >
> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to ex

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-16 Thread Suren Baghdasaryan

Hi Michael,

On Sat, Feb 13, 2021 at 2:04 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 2/2/21 11:12 PM, Suren Baghdasaryan wrote:
> > Hi Michael,
> >
> > On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren (and Minchan and Michal)
> >>
> >> Thank you for the revisions!
> >>
> >> I've applied this patch, and done a few light edits.
> >
> > Thanks!
> >
> >>
> >> However, I have a questions about undocumented pieces in *madvise(2)*,
> >> as well as one other question. See below.
> >>
> >> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> >>> Initial version of process_madvise(2) manual page. Initial text was
> >>> extracted from [1], amended after fix [2] and more details added using
> >>> man pages of madvise(2) and process_vm_read(2) as examples. It also
> >>> includes the changes to required permission proposed in [3].
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/1297933/
> >>> [2] https://lkml.org/lkml/2020/12/8/1282
> >>> [3] 
> >>> https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> >>>
> >>> Signed-off-by: Suren Baghdasaryan 
> >>> Reviewed-by: Michal Hocko 
> >>> ---
> >>> changes in v2:
> >>> - Changed description of MADV_COLD per Michal Hocko's suggestion
> >>> - Applied fixes suggested by Michael Kerrisk
> >>> changes in v3:
> >>> - Added Michal's Reviewed-by
> >>> - Applied additional fixes suggested by Michael Kerrisk
> >>>
> >>> NAME
> >>> process_madvise - give advice about use of memory to a process
> >>>
> >>> SYNOPSIS
> >>> #include 
> >>>
> >>> ssize_t process_madvise(int pidfd,
> >>>const struct iovec *iovec,
> >>>unsigned long vlen,
> >>>int advice,
> >>>unsigned int flags);
> >>>
> >>> DESCRIPTION
> >>> The process_madvise() system call is used to give advice or directions
> >>> to the kernel about the address ranges of another process or the 
> >>> calling
> >>> process. It provides the advice to the address ranges described by 
> >>> iovec
> >>> and vlen. The goal of such advice is to improve system or application
> >>> performance.
> >>>
> >>> The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> >>> specifies the process to which the advice is to be applied.
> >>>
> >>> The pointer iovec points to an array of iovec structures, defined in
> >>>  as:
> >>>
> >>> struct iovec {
> >>> void  *iov_base;/* Starting address */
> >>> size_t iov_len; /* Number of bytes to transfer */
> >>> };
> >>>
> >>> The iovec structure describes address ranges beginning at iov_base 
> >>> address
> >>> and with the size of iov_len bytes.
> >>>
> >>> The vlen represents the number of elements in the iovec structure.
> >>>
> >>> The advice argument is one of the values listed below.
> >>>
> >>>   Linux-specific advice values
> >>> The following Linux-specific advice values have no counterparts in the
> >>> POSIX-specified posix_madvise(3), and may or may not have counterparts
> >>> in the madvise(2) interface available on other implementations.
> >>>
> >>> MADV_COLD (since Linux 5.4.1)
> >>
> >> I just noticed these version numbers now, and thought: they can't be
> >> right (because the system call appeared only in v5.11). So I removed
> >> them. But, of course in another sense the version numbers are (nearly)
> >> right, since these advice values were added for madvise(2) in Linux 5.4.
> >> However, they are not documented in the madvise(2) manual page. Is it
> >> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> >> meaning in madvise(2) (but just for the calling process, of course)?
> >
> > Correct. They should be added in the madvise(2) man page as well IMHO.
>
> So, I decided to move the d

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Suren Baghdasaryan

On Wed, Apr 7, 2021 at 9:07 AM Linus Torvalds
 wrote:
>
> On Wed, Apr 7, 2021 at 6:22 AM Vlastimil Babka  wrote:
> >
> > 1) Ignore the issue (outside of Android at least). The security model of 
> > zygote
> > is unusual. Where else a parent of fork() doesn't trust the child, which is 
> > the
> > same binary?
>
> Agreed. I think this is basically an android-only issue (with
> _possibly_ some impact on crazy "pin-and-fork" loads), and doesn't
> necessarily merit a backport at all.
>
> If Android people insist on using very old kernels, knowing that they
> do things that are questionable with those old kernels, at some point
> it's just _their_ problem.

We don't really insist on using old kernels but rather we are stuck
with them for some time.
Trying my hand at backporting the patchsets Peter mentioned proved
this to be far from easy with many dependencies. Let me look into
Vlastimil's suggestion to backport only 17839856fd58 and it sounds
like 5.4 already followed that path. Thanks for all the information!
Suren.

>
>  Linus
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Suren Baghdasaryan

On Wed, Apr 7, 2021 at 12:23 PM Linus Torvalds
 wrote:
>
> On Wed, Apr 7, 2021 at 11:47 AM Mikulas Patocka  wrote:
> >
> > So, we fixed it, but we don't know why.
> >
> > Peter Xu's patchset that fixed it is here:
> > https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/
>
> Yeah, that's the part that ends up being really painful to backport
> (with all the subsequent fixes too), so the 4.14 people would prefer
> to avoid it.
>
> But I think that if it's a "requires dax pmem and ptrace on top", it
> may simply be a non-issue for those users. Although who knows - maybe
> that ends up being a real issue on Android..

A lot to digest, so I need to do some reading now. Thanks everyone!

>
> Linus

Re: [PATCH v8 03/16] sched/core: uclamp: Enforce last task's UCLAMP_MAX

2019-04-17 Thread Suren Baghdasaryan

 Hi Patrick,

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi  wrote:
>
> When a task sleeps it removes its max utilization clamp from its CPU.
> However, the blocked utilization on that CPU can be higher than the max
> clamp value enforced while the task was running. This allows undesired
> CPU frequency increases while a CPU is idle, for example, when another
> CPU on the same frequency domain triggers a frequency update, since
> schedutil can now see the full not clamped blocked utilization of the
> idle CPU.
>
> Fix this by using
>   uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
> uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)
> to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
> condition.
>

If I understand the intent correctly, you are trying to exclude idle
CPUs from affecting calculations of rq UCLAMP_MAX value. If that is
true I think description can be simplified a bit :) In particular it
took me some time to understand what "blocked utilization" means,
however if it's a widely accepted term then feel free to ignore my
input.

> Don't track any minimum utilization clamps since an idle CPU never
> requires a minimum frequency. The decay of the blocked utilization is
> good enough to reduce the CPU frequency.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
>
> --
> Changes in v8:
>  Message-ID: <20190314170619.rt6yhelj3y6dzypu@e110439-lin>
>  - moved flag reset into uclamp_rq_inc()
> ---
>  kernel/sched/core.c  | 45 
>  kernel/sched/sched.h |  2 ++
>  2 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6e1beae5f348..046f61d33f00 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -754,8 +754,35 @@ static inline unsigned int uclamp_none(int clamp_id)
> return SCHED_CAPACITY_SCALE;
>  }
>
> +static inline unsigned int
> +uclamp_idle_value(struct rq *rq, unsigned int clamp_id, unsigned int 
> clamp_value)
> +{
> +   /*
> +* Avoid blocked utilization pushing up the frequency when we go
> +* idle (which drops the max-clamp) by retaining the last known
> +* max-clamp.
> +*/
> +   if (clamp_id == UCLAMP_MAX) {
> +   rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
> +   return clamp_value;
> +   }
> +
> +   return uclamp_none(UCLAMP_MIN);
> +}
> +
> +static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
> +unsigned int clamp_value)
> +{
> +   /* Reset max-clamp retention only on idle exit */
> +   if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> +   return;
> +
> +   WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> +}
> +
>  static inline
> -unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id)
> +unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
> +unsigned int clamp_value)

IMHO the name of uclamp_rq_max_value() is a bit misleading because:
1. It does not imply that it has to be called only when there are no
more runnable tasks on a CPU. This is currently the case because it's
called only from uclamp_rq_dec_id() and only when bucket->tasks==0 but
nothing in the name of this function indicates that it can't be called
from other places.
2. It does not imply that it marks rq UCLAMP_FLAG_IDLE.

>  {
> struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> int bucket_id = UCLAMP_BUCKETS - 1;
> @@ -771,7 +798,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
> int clamp_id)
> }
>
> /* No tasks -- default clamp values */
> -   return uclamp_none(clamp_id);
> +   return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>
>  /*
> @@ -794,6 +821,8 @@ static inline void uclamp_rq_inc_id(struct task_struct 
> *p, struct rq *rq,
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> bucket->tasks++;
>
> +   uclamp_idle_reset(rq, clamp_id, uc_se->value);
> +
> /*
>  * Local max aggregation: rq buckets always track the max
>  * "requested" clamp value of its RUNNABLE tasks.
> @@ -820,6 +849,7 @@ static inline void uclamp_rq_dec_id(struct task_struct 
> *p, struct rq *rq,
> struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> struct uclamp_bucket *bucket;
> +   unsigned int bkt_clamp;
> unsigned int rq_clamp;
>
> bucket = &uc_rq->bucket[uc_se->bucket_id];
> @@ -848,7 +878,8 @@ static inline void uclamp_rq_dec_id(struct task_struct 
> *p, struct rq *rq,
>  * there are anymore RUNNABLE tasks refcounting it.
>  */
> bucket->value = uclamp_bucket_base_value(bucket->value);
> -   WRITE_ONCE(uc_rq->value, uclamp_rq_max_value(rq, clamp_id));
> +   bkt_clamp = uclamp_rq_max_value(rq

Re: [PATCH v8 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping

2019-04-17 Thread Suren Baghdasaryan

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi  wrote:
>
> The SCHED_DEADLINE scheduling class provides an advanced and formal
> model to define tasks requirements that can translate into proper
> decisions for both task placements and frequencies selections. Other
> classes have a more simplified model based on the POSIX concept of
> priorities.
>
> Such a simple priority based model however does not allow to exploit
> most advanced features of the Linux scheduler like, for example, driving
> frequencies selection via the schedutil cpufreq governor. However, also
> for non SCHED_DEADLINE tasks, it's still interesting to define tasks
> properties to support scheduler decisions.
>
> Utilization clamping exposes to user-space a new set of per-task
> attributes the scheduler can use as hints about the expected/required
> utilization for a task. This allows to implement a "proactive" per-task
> frequency control policy, a more advanced policy than the current one
> based just on "passive" measured task utilization. For example, it's
> possible to boost interactive tasks (e.g. to get better performance) or
> cap background tasks (e.g. to be more energy/thermal efficient).
>
> Introduce a new API to set utilization clamping values for a specified
> task by extending sched_setattr(), a syscall which already allows to
> define task specific properties for different scheduling classes. A new
> pair of attributes allows to specify a minimum and maximum utilization
> the scheduler can consider for a task.
>
> Do that by validating the required clamp values before and then applying
> the required changes using _the_ same pattern already in use for
> __setscheduler(). This ensures that the task is re-enqueued with the new
> clamp values.
>
> Do not allow to change sched class specific params and non class
> specific params (i.e. clamp values) at the same time.  This keeps things
> simple and still works for the most common cases since we are usually
> interested in just one of the two actions.

Sorry, I can't find where you are checking to eliminate the
possibility of simultaneous changes to both sched class specific
params and non class specific params... Am I too tired or they are
indeed missing?

>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
>
> ---
> Changes in v8:
>  Others:
>  - using p->uclamp_req to track clamp values "requested" from userspace
> ---
>  include/linux/sched.h|  9 
>  include/uapi/linux/sched.h   | 12 -
>  include/uapi/linux/sched/types.h | 66 
>  kernel/sched/core.c  | 87 +++-
>  4 files changed, 162 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d8491954e2e1..c2b81a84985b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -585,6 +585,7 @@ struct sched_dl_entity {
>   * @value: clamp value "assigned" to a se
>   * @bucket_id: bucket index corresponding to the "assigned" value
>   * @active:the se is currently refcounted in a rq's bucket
> + * @user_defined:  the requested clamp value comes from user-space
>   *
>   * The bucket_id is the index of the clamp bucket matching the clamp value
>   * which is pre-computed and stored to avoid expensive integer divisions from
> @@ -594,11 +595,19 @@ struct sched_dl_entity {
>   * which can be different from the clamp value "requested" from user-space.
>   * This allows to know a task is refcounted in the rq's bucket corresponding
>   * to the "effective" bucket_id.
> + *
> + * The user_defined bit is set whenever a task has got a task-specific clamp
> + * value requested from userspace, i.e. the system defaults apply to this 
> task
> + * just as a restriction. This allows to relax default clamps when a less
> + * restrictive task-specific value has been requested, thus allowing to
> + * implement a "nice" semantic. For example, a task running with a 20%
> + * default boost can still drop its own boosting to 0%.
>   */
>  struct uclamp_se {
> unsigned int value  : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id  : bits_per(UCLAMP_BUCKETS);
> unsigned int active : 1;
> +   unsigned int user_defined   : 1;
>  };
>  #endif /* CONFIG_UCLAMP_TASK */
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 075c610adf45..d2c65617a4a4 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -53,10 +53,20 @@
>  #define SCHED_FLAG_RECLAIM 0x02
>  #define SCHED_FLAG_DL_OVERRUN  0x04
>  #define SCHED_FLAG_KEEP_POLICY 0x08
> +#define SCHED_FLAG_KEEP_PARAMS 0x10
> +#define SCHED_FLAG_UTIL_CLAMP_MIN  0x20
> +#define SCHED_FLAG_UTIL_CLAMP_MAX  0x40
> +
> +#define SCHED_FLAG_KEEP_ALL(SCHED_FLAG_KEEP_POLICY | \
> +SCHED_FLAG_KEEP_PARAMS)
> +
> +#d

Re: [PATCH v8 08/16] sched/core: uclamp: Set default clamps for RT tasks

2019-04-17 Thread Suren Baghdasaryan

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi  wrote:
>
> By default FAIR tasks start without clamps, i.e. neither boosted nor
> capped, and they run at the best frequency matching their utilization
> demand.  This default behavior does not fit RT tasks which instead are
> expected to run at the maximum available frequency, if not otherwise
> required by explicitly capping them.
>
> Enforce the correct behavior for RT tasks by setting util_min to max
> whenever:
>
>  1. the task is switched to the RT class and it does not already have a
> user-defined clamp value assigned.
>
>  2. an RT task is forked from a parent with RESET_ON_FORK set.
>
> NOTE: utilization clamp values are cross scheduling class attributes and
> thus they are never changed/reset once a value has been explicitly
> defined from user-space.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> ---
>  kernel/sched/core.c | 26 ++
>  1 file changed, 26 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bdebdabe9bc4..71c9dd6487b1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1042,6 +1042,28 @@ static int uclamp_validate(struct task_struct *p,
>  static void __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)
>  {
> +   unsigned int clamp_id;
> +
> +   /*
> +* On scheduling class change, reset to default clamps for tasks
> +* without a task-specific value.
> +*/
> +   for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> +   struct uclamp_se *uc_se = &p->uclamp_req[clamp_id];
> +   unsigned int clamp_value = uclamp_none(clamp_id);
> +
> +   /* Keep using defined clamps across class changes */
> +   if (uc_se->user_defined)
> +   continue;
> +
> +   /* By default, RT tasks always get 100% boost */
> +   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +   clamp_value = uclamp_none(UCLAMP_MAX);
> +
> +   uc_se->bucket_id = uclamp_bucket_id(clamp_value);
> +   uc_se->value = clamp_value;

Is it possible for p->uclamp_req[UCLAMP_MAX].value to be less than
uclamp_none(UCLAMP_MAX) for this RT task? If that's a possibility then
I think we will end up with a case of p->uclamp_req[UCLAMP_MIN].value
> p->uclamp_req[UCLAMP_MAX].value after these assignments are done.

> +   }
> +
> if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> return;
>
> @@ -1077,6 +1099,10 @@ static void uclamp_fork(struct task_struct *p)
> for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> unsigned int clamp_value = uclamp_none(clamp_id);
>
> +   /* By default, RT tasks always get 100% boost */
> +   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +   clamp_value = uclamp_none(UCLAMP_MAX);
> +
> p->uclamp_req[clamp_id].user_defined = false;
> p->uclamp_req[clamp_id].value = clamp_value;
> p->uclamp_req[clamp_id].bucket_id = 
> uclamp_bucket_id(clamp_value);
> --
> 2.20.1
>

Re: [PATCH v8 12/16] sched/core: uclamp: Extend CPU's cgroup controller

2019-04-17 Thread Suren Baghdasaryan

On Tue, Apr 2, 2019 at 3:43 AM Patrick Bellasi  wrote:
>
> The cgroup CPU bandwidth controller allows to assign a specified
> (maximum) bandwidth to the tasks of a group. However this bandwidth is
> defined and enforced only on a temporal base, without considering the
> actual frequency a CPU is running on. Thus, the amount of computation
> completed by a task within an allocated bandwidth can be very different
> depending on the actual frequency the CPU is running that task.
> The amount of computation can be affected also by the specific CPU a
> task is running on, especially when running on asymmetric capacity
> systems like Arm's big.LITTLE.
>
> With the availability of schedutil, the scheduler is now able
> to drive frequency selections based on actual task utilization.
> Moreover, the utilization clamping support provides a mechanism to
> bias the frequency selection operated by schedutil depending on
> constraints assigned to the tasks currently RUNNABLE on a CPU.
>
> Giving the mechanisms described above, it is now possible to extend the
> cpu controller to specify the minimum (or maximum) utilization which
> should be considered for tasks RUNNABLE on a cpu.
> This makes it possible to better defined the actual computational
> power assigned to task groups, thus improving the cgroup CPU bandwidth
> controller which is currently based just on time constraints.
>
> Extend the CPU controller with a couple of new attributes util.{min,max}
> which allows to enforce utilization boosting and capping for all the
> tasks in a group. Specifically:
>
> - util.min: defines the minimum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run at least at a
>  minimum frequency which corresponds to the util.min
>  utilization
>
> - util.max: defines the maximum utilization which should be considered
> i.e. the RUNNABLE tasks of this group will run up to a
>  maximum frequency which corresponds to the util.max
>  utilization
>
> These attributes:
>
> a) are available only for non-root nodes, both on default and legacy
>hierarchies, while system wide clamps are defined by a generic
>interface which does not depends on cgroups. This system wide
>interface enforces constraints on tasks in the root node.
>
> b) enforce effective constraints at each level of the hierarchy which
>are a restriction of the group requests considering its parent's
>effective constraints. Root group effective constraints are defined
>by the system wide interface.
>This mechanism allows each (non-root) level of the hierarchy to:
>- request whatever clamp values it would like to get
>- effectively get only up to the maximum amount allowed by its parent
>
> c) have higher priority than task-specific clamps, defined via
>sched_setattr(), thus allowing to control and restrict task requests
>
> Add two new attributes to the cpu controller to collect "requested"
> clamp values. Allow that at each non-root level of the hierarchy.
> Validate local consistency by enforcing util.min < util.max.
> Keep it simple by do not caring now about "effective" values computation
> and propagation along the hierarchy.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
>
> --
> Changes in v8:
>  Message-ID: <20190214154817.gn50...@devbig004.ftw2.facebook.com>
>  - update changelog description for points b), c) and following paragraph
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  27 +
>  init/Kconfig|  22 
>  kernel/sched/core.c | 142 +++-
>  kernel/sched/sched.h|   6 +
>  4 files changed, 196 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index 7bf3f129c68b..47710a77f4fa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -909,6 +909,12 @@ controller implements weight and absolute bandwidth 
> limit models for
>  normal scheduling policy and absolute bandwidth allocation model for
>  realtime scheduling policy.
>
> +Cycles distribution is based, by default, on a temporal base and it
> +does not account for the frequency at which tasks are executed.
> +The (optional) utilization clamping support allows to enforce a minimum
> +bandwidth, which should always be provided by a CPU, and a maximum bandwidth,
> +which should never be exceeded by a CPU.
> +
>  WARNING: cgroup2 doesn't yet support control of realtime processes and
>  the cpu controller can only be enabled when all RT processes are in
>  the root cgroup.  Be aware that system management software may already
> @@ -974,6 +980,27 @@ All time durations are in microseconds.
> Shows pressure stall information for CPU. See
> Documentation/accounting/psi.txt for d

Re: [PATCH v8 04/16] sched/core: uclamp: Add system default clamps

2019-04-17 Thread Suren Baghdasaryan

On Tue, Apr 2, 2019 at 3:42 AM Patrick Bellasi  wrote:
>
> Tasks without a user-defined clamp value are considered not clamped
> and by default their utilization can have any value in the
> [0..SCHED_CAPACITY_SCALE] range.
>
> Tasks with a user-defined clamp value are allowed to request any value
> in that range, and the required clamp is unconditionally enforced.
> However, a "System Management Software" could be interested in limiting
> the range of clamp values allowed for all tasks.
>
> Add a privileged interface to define a system default configuration via:
>
>   /proc/sys/kernel/sched_uclamp_util_{min,max}
>
> which works as an unconditional clamp range restriction for all tasks.
>
> With the default configuration, the full SCHED_CAPACITY_SCALE range of
> values is allowed for each clamp index. Otherwise, the task-specific
> clamp is capped by the corresponding system default value.
>
> Do that by tracking, for each task, the "effective" clamp value and
> bucket the task has been refcounted in at enqueue time. This
> allows to lazy aggregate "requested" and "system default" values at
> enqueue time and simplifies refcounting updates at dequeue time.
>
> The cached bucket ids are used to avoid (relatively) more expensive
> integer divisions every time a task is enqueued.
>
> An active flag is used to report when the "effective" value is valid and
> thus the task is actually refcounted in the corresponding rq's bucket.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
>
> ---
> Changes in v8:
>  Message-ID: <20190313201010.gu2...@worktop.programming.kicks-ass.net>
>  - add "requested" values uclamp_se instance beside the existing
>"effective" values instance
>  - rename uclamp_effective_{get,assign}() into uclamp_eff_{get,set}()
>  - make uclamp_eff_get() return the new "effective" values by copy
> Message-ID: <20190318125844.ajhjpaqlcgxn7qkq@e110439-lin>
>  - run uclamp_fork() code independently from the class being supported.
>Resetting active flag is not harmful and following patches will add
>other code which still needs to be executed independently from class
>support.
> Message-ID: <20190313201342.gv2...@worktop.programming.kicks-ass.net>
>  - add sysctl_sched_uclamp_handler()'s internal mutex to serialize
>concurrent usages
> ---
>  include/linux/sched.h|  10 +++
>  include/linux/sched/sysctl.h |  11 +++
>  kernel/sched/core.c  | 131 ++-
>  kernel/sysctl.c  |  16 +
>  4 files changed, 167 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0c0dd7aac8e9..d8491954e2e1 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -584,14 +584,21 @@ struct sched_dl_entity {
>   * Utilization clamp for a scheduling entity
>   * @value: clamp value "assigned" to a se
>   * @bucket_id: bucket index corresponding to the "assigned" value
> + * @active:the se is currently refcounted in a rq's bucket
>   *
>   * The bucket_id is the index of the clamp bucket matching the clamp value
>   * which is pre-computed and stored to avoid expensive integer divisions from
>   * the fast path.
> + *
> + * The active bit is set whenever a task has got an "effective" value 
> assigned,
> + * which can be different from the clamp value "requested" from user-space.
> + * This allows to know a task is refcounted in the rq's bucket corresponding
> + * to the "effective" bucket_id.
>   */
>  struct uclamp_se {
> unsigned int value  : bits_per(SCHED_CAPACITY_SCALE);
> unsigned int bucket_id  : bits_per(UCLAMP_BUCKETS);
> +   unsigned int active : 1;
>  };
>  #endif /* CONFIG_UCLAMP_TASK */
>
> @@ -676,6 +683,9 @@ struct task_struct {
> struct sched_dl_entity  dl;
>
>  #ifdef CONFIG_UCLAMP_TASK
> +   /* Clamp values requested for a scheduling entity */
> +   struct uclamp_seuclamp_req[UCLAMP_CNT];
> +   /* Effective clamp values used for a scheduling entity */
> struct uclamp_seuclamp[UCLAMP_CNT];
>  #endif
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 99ce6d728df7..d4f6215ee03f 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -56,6 +56,11 @@ int sched_proc_update_handler(struct ctl_table *table, int 
> write,
>  extern unsigned int sysctl_sched_rt_period;
>  extern int sysctl_sched_rt_runtime;
>
> +#ifdef CONFIG_UCLAMP_TASK
> +extern unsigned int sysctl_sched_uclamp_util_min;
> +extern unsigned int sysctl_sched_uclamp_util_max;
> +#endif
> +
>  #ifdef CONFIG_CFS_BANDWIDTH
>  extern unsigned int sysctl_sched_cfs_bandwidth_slice;
>  #endif
> @@ -75,6 +80,12 @@ extern int sched_rt_handler(struct ctl_table *table, int 
> write,
> void __user *buffer, size_t *lenp,
> loff_t *ppos);
>
> +#ifdef CONFIG_UCLAMP_T

Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-25 Thread Suren Baghdasaryan

On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione  wrote:
>
> On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko  wrote:
> >
> > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > Well, you are tearing down an address space which might be still in use
> > because the task not fully dead yeat. So there are two downsides AFAICS.
> > Core dumping which will not see the reaped memory so the resulting
>
> Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
> try to start a core dump after reaping begins, too bad: you could have
> raced with process death anyway.
>
> > coredump might be incomplete. And unexpected #PF/gup on the reaped
> > memory will result in SIGBUS.
>
> It's a dying process. Why even bother returning from the fault
> handler? Just treat that situation as a thread exit. There's no need
> to make this observable to userspace at all.

I've spent some more time to investigate possible effects of reaping
on coredumps and asked Oleg Nesterov who worked on patchsets that
prioritize SIGKILLs over coredump activity
(https://lkml.org/lkml/2013/2/17/118). Current do_coredump
implementation seems to handle the case of SIGKILL interruption by
bailing out whenever dump_interrupted() returns true and that would be
the case with pending SIGKILL. So in the case of race when coredump
happens first and SIGKILL comes next interrupting the coredump seems
to result in no change in behavior and reaping memory proactively
seems to have no side effects.
An opposite race when SIGKILL gets posted and then coredump happens
seems impossible because do_coredump won't be called from get_signal
due to signal_group_exit check (get_signal checks signal_group_exit
while holding sighand->siglock and complete_signal sets
SIGNAL_GROUP_EXIT while holding the same lock). There is a path from
__seccomp_filter calling do_coredump while processing
SECCOMP_RET_KILL_xxx but even then it should bail out when
coredump_wait()->zap_threads(current) checks signal_group_exit().
(Thanks Oleg for clarifying this for me).

If we are really concerned about possible increase in failed coredumps
because of the proactive reaping I could make it conditional on
whether coredumping mm is possible by placing this feature behind
!get_dumpable(mm) condition. Another possibility is to check
RLIMIT_CORE to decide if coredumps are possible (although if pipe is
used for coredump that limit seems to be ignored, so that check would
have to take this into consideration).

On the issue of SIGBUS happening when accessed memory was already
reaped, my understanding that SIGBUS being a synchronous signal will
still have to be fetched using dequeue_synchronous_signal from
get_signal but not before signal_group_exit is checked. So again if
SIGKILL is pending I think SIGBUS will be ignored (please correct me
if that's not correct).

One additional question I would like to clarify is whether per-node
reapers like Roman suggested would make a big difference (All CPUs
that I've seen used for Android are single-node ones, so looking for
more feedback here). If it's important then reaping victim's memory by
the killer is probably not an option.

Re: [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c

2019-04-25 Thread Suren Baghdasaryan

On Thu, Apr 25, 2019 at 2:13 PM Tetsuo Handa
 wrote:
>
> On 2019/04/11 10:43, Suren Baghdasaryan wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 3a2484884cfd..6449710c8a06 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc)
> >   return !!oc->chosen;
> >  }
> >
> > +bool expedite_reclaim(struct task_struct *task)
> > +{
> > + bool res = false;
> > +
> > + task_lock(task);
> > + if (task_will_free_mem(task)) {
>
> mark_oom_victim() needs to be called under oom_lock mutex after
> checking that oom_killer_disabled == false. Since you are trying
> to trigger this function from signal handler, you might want to
> defer until e.g. WQ context.

Thanks for the tip! I'll take this into account in the new design.
Just thinking out loud... AFAIU oom_lock is there to protect against
multiple concurrent out_of_memory calls from different contexts and
prevent overly-aggressive process killing. For my purposes when
reaping memory of a killed process we don't have this concern (we did
not initiate the killing, SIGKILL was explicitly requested). I'll
probably need some synchronization there but not for purposes of
preventing multiple concurrent reapers. In any case, thank you for the
feedback!

>
> > + mark_oom_victim(task);
> > + wake_oom_reaper(task);
> > + res = true;
> > + }
> > + task_unlock(task);
> > +
> > + return res;
> > +}
> > +
> >  /*
> >   * The pagefault handler calls here because it is out of memory, so kill a
> >   * memory-hogging task. If oom_lock is held by somebody else, a parallel 
> > oom
> >

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Suren Baghdasaryan

On Tue, Mar 12, 2019 at 1:05 AM Michal Hocko  wrote:
>
> On Mon 11-03-19 15:15:35, Suren Baghdasaryan wrote:
> > On Mon, Mar 11, 2019 at 1:46 PM Sultan Alsawaf  
> > wrote:
> > >
> > > On Mon, Mar 11, 2019 at 01:10:36PM -0700, Suren Baghdasaryan wrote:
> > > > The idea seems interesting although I need to think about this a bit
> > > > more. Killing processes based on failed page allocation might backfire
> > > > during transient spikes in memory usage.
> > >
> > > This issue could be alleviated if tasks could be killed and have their 
> > > pages
> > > reaped faster. Currently, Linux takes a _very_ long time to free a task's 
> > > memory
> > > after an initial privileged SIGKILL is sent to a task, even with the 
> > > task's
> > > priority being set to the highest possible (so unwanted scheduler 
> > > preemption
> > > starving dying tasks of CPU time is not the issue at play here). I've
> > > frequently measured the difference in time between when a SIGKILL is sent 
> > > for a
> > > task and when free_task() is called for that task to be hundreds of
> > > milliseconds, which is incredibly long. AFAIK, this is a problem that LMKD
> > > suffers from as well, and perhaps any OOM killer implementation in Linux, 
> > > since
> > > you cannot evaluate effect you've had on memory pressure by killing a 
> > > process
> > > for at least several tens of milliseconds.
> >
> > Yeah, killing speed is a well-known problem which we are considering
> > in LMKD. For example the recent LMKD change to assign process being
> > killed to a cpuset cgroup containing big cores cuts the kill time
> > considerably. This is not ideal and we are thinking about better ways
> > to expedite the cleanup process.
>
> If you design is relies on the speed of killing then it is fundamentally
> flawed AFAICT. You cannot assume anything about how quickly a task dies.
> It might be blocked in an uninterruptible sleep or performin an
> operation which takes some time. Sure, oom_reaper might help here but
> still.

That's what I was considering. This is not a silver bullet but
increased speed would not hurt.

> The only way to control the OOM behavior pro-actively is to throttle
> allocation speed. We have memcg high limit for that purpose. Along with
> PSI, I can imagine a reasonably working user space early oom
> notifications and reasonable acting upon that.

That makes sense and we are working in this direction.

> --
> Michal Hocko
> SUSE Labs

Thanks,
Suren.

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-12 Thread Suren Baghdasaryan

On Tue, Mar 12, 2019 at 9:58 AM Michal Hocko  wrote:
>
> On Tue 12-03-19 09:37:41, Sultan Alsawaf wrote:
> > I have not had a chance to look at PSI yet, but
> > unless a PSI-enabled solution allows allocations to reach the same point as 
> > when
> > the OOM killer is invoked (which is contradictory to what it sets out to 
> > do),

LMK's job is to relieve memory pressure before we reach the boiling
point at which OOM killer has to be invoked. If we wait that long it
will definitely affect user experience. There might be usecases when
you might not care about this but on interactive systems like Android
it is important.

> > then it cannot take advantage of all of the alternative memory-reclaim means
> > employed in the slowpath, and will result in killing a process before it is
> > _really_ necessary.

I guess it's a matter of defining when is it _really_ necessary to
kill. In Android case that should be when the user starts suffering
from the delays caused by memory contention and that delay is exactly
what PSI is measuring.

> One more note. The above is true, but you can also hit one of the
> thrashing reclaim behaviors and reclaim last few pages again and again
> with the whole system really sluggish. That is what PSI is trying to
> help with.
> --
> Michal Hocko
> SUSE Labs

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-13 Thread Suren Baghdasaryan

On Wed, Mar 13, 2019 at 8:15 AM Patrick Bellasi  wrote:
>
> On 12-Mar 13:52, Dietmar Eggemann wrote:
> > On 2/8/19 11:05 AM, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > +config UCLAMP_BUCKETS_COUNT
> > > +   int "Number of supported utilization clamp buckets"
> > > +   range 5 20
> > > +   default 5
> > > +   depends on UCLAMP_TASK
> > > +   help
> > > + Defines the number of clamp buckets to use. The range of each bucket
> > > + will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
> > > + number of clamp buckets the finer their granularity and the higher
> > > + the precision of clamping aggregation and tracking at run-time.
> > > +
> > > + For example, with the default configuration we will have 5 clamp
> > > + buckets tracking 20% utilization each. A 25% boosted tasks will be
> > > + refcounted in the [20..39]% bucket and will set the bucket clamp
> > > + effective value to 25%.
> > > + If a second 30% boosted task should be co-scheduled on the same CPU,
> > > + that task will be refcounted in the same bucket of the first task 
> > > and
> > > + it will boost the bucket clamp effective value to 30%.
> > > + The clamp effective value of a bucket is reset to its nominal value
> > > + (20% in the example above) when there are anymore tasks refcounted 
> > > in
> >
> > this sounds weird.
>
> Why ?

Should probably be "when there are no more tasks refcounted"

> >
> > [...]
> >
> > > +static inline unsigned int uclamp_bucket_value(unsigned int clamp_value)
> > > +{
> > > +   return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
> > > +}
> >
> > Soemthing like uclamp_bucket_nominal_value() should be clearer.
>
> Maybe... can update it in v8
>

uclamp_bucket_base_value is a little shorter, just to consider :)

> > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> > > +{
> > > +   struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > +   unsigned int max_value = uclamp_none(clamp_id);
> > > +   unsigned int bucket_id;
> >
> > unsigned int bucket_id = UCLAMP_BUCKETS;
> >
> > > +
> > > +   /*
> > > +* Both min and max clamps are MAX aggregated, thus the topmost
> > > +* bucket with some tasks defines the rq's clamp value.
> > > +*/
> > > +   bucket_id = UCLAMP_BUCKETS;
> >
> > to get rid of this line?
>
> I put it on a different line as a justfication for the loop variable
> initialization described in the comment above.
>
> >
> > > +   do {
> > > +   --bucket_id;
> > > +   if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> >
> > if (!bucket[bucket_id].tasks)
>
> Right... that's some leftover from the last refactoring!
>
> [...]
>
> > > + * within each bucket the exact "requested" clamp value whenever all 
> > > tasks
> > > + * RUNNABLE in that bucket require the same clamp.
> > > + */
> > > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > > +   unsigned int clamp_id)
> > > +{
> > > +   unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > > +   unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> >
> > Wouldn't it be easier to have a pointer to the task's and rq's uclamp
> > structure as well to the bucket?
> >
> > -   unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > +   struct uclamp_se *uc_se = &p->uclamp[clamp_id];
> > +   struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
> > +   struct uclamp_bucket *bucket = &uc_rq->bucket[uc_se->bucket_id];
>
> I think I went back/forth a couple of times in using pointer or the
> extended version, which both have pros and cons.
>
> I personally prefer the pointers as you suggest but I've got the
> impression in the past that since everybody cleared "basic C trainings"
> it's not so difficult to read the code above too.
>
> > The code in uclamp_rq_inc_id() and uclamp_rq_dec_id() for example becomes
> > much more readable.
>
> Agree... let's try to switch once again in v8 and see ;)
>
> > [...]
> >
> > >   struct sched_class {
> > > const struct sched_class *next;
> > > +#ifdef CONFIG_UCLAMP_TASK
> > > +   int uclamp_enabled;
> > > +#endif
> > > +
> > > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int 
> > > flags);
> > > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int 
> > > flags);
> > > -   void (*yield_task)   (struct rq *rq);
> > > -   bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool 
> > > preempt);
> > > void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int 
> > > flags);
> > > @@ -1685,7 +1734,6 @@ struct sched_class {
> > > void (*set_curr_task)(struct rq *rq);
> > > void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
> > > void (*task_fork)(struct task_struct *p);
> > > -   void (*task_dead)(struct task_struct *p);
> > > /*
> > >  * The switched_from() call is allowed to drop rq->lock, therefore we
> > > @@ -1702,12 +1750,17 @@

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-13 Thread Suren Baghdasaryan

On Wed, Mar 13, 2019 at 12:46 PM Peter Zijlstra  wrote:
>
> On Wed, Mar 13, 2019 at 03:23:59PM +, Patrick Bellasi wrote:
> > On 13-Mar 15:09, Peter Zijlstra wrote:
> > > On Fri, Feb 08, 2019 at 10:05:40AM +, Patrick Bellasi wrote:
>
> > > > +static inline void uclamp_rq_update(struct rq *rq, unsigned int 
> > > > clamp_id)
> > > > +{
> > > > + struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> > > > + unsigned int max_value = uclamp_none(clamp_id);
> > >
> > > That's 1024 for uclamp_max
> > >
> > > > + unsigned int bucket_id;
> > > > +
> > > > + /*
> > > > +  * Both min and max clamps are MAX aggregated, thus the topmost
> > > > +  * bucket with some tasks defines the rq's clamp value.
> > > > +  */
> > > > + bucket_id = UCLAMP_BUCKETS;
> > > > + do {
> > > > + --bucket_id;
> > > > + if (!rq->uclamp[clamp_id].bucket[bucket_id].tasks)
> > > > + continue;
> > > > + max_value = bucket[bucket_id].value;
> > >
> > > but this will then _lower_ it. That's not a MAX aggregate.
> >
> > For uclamp_max we want max_value=1024 when there are no active tasks,
> > which means: no max clamp enforced on CFS/RT "idle" cpus.
> >
> > If instead there are active RT/CFS tasks then we want the clamp value
> > of the max group, which means: MAX aggregate active clamps.
> >
> > That's what the code above does and the comment says.
>
> That's (obviously) not how I read it... maybe something like:
>
> static inline void uclamp_rq_update(struct rq *rq, unsigned int clamp_id)
> {
> struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
> int i;
>
> /*
>  * Since both min and max clamps are max aggregated, find the
>  * top most bucket with tasks in.
>  */
> for (i = UCLMAP_BUCKETS-1; i>=0; i--) {
> if (!bucket[i].tasks)
> continue;
> return bucket[i].value;
> }
>
> /* No tasks -- default clamp values */
> return uclamp_none(clamp_id);
> }
>
> would make it clearer?

This way it's also more readable/obvious when it's used inside
uclamp_rq_dec_id, assuming uclamp_rq_update is renamed into smth like
get_max_rq_uclamp.

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-13 Thread Suren Baghdasaryan

On Wed, Mar 13, 2019 at 6:52 AM Peter Zijlstra  wrote:
>
> On Fri, Feb 08, 2019 at 10:05:40AM +, Patrick Bellasi wrote:
> > +/*
> > + * When a task is enqueued on a rq, the clamp bucket currently defined by 
> > the
> > + * task's uclamp::bucket_id is reference counted on that rq. This also
> > + * immediately updates the rq's clamp value if required.
> > + *
> > + * Since tasks know their specific value requested from user-space, we 
> > track
> > + * within each bucket the maximum value for tasks refcounted in that 
> > bucket.
> > + * This provide a further aggregation (local clamping) which allows to 
> > track
> > + * within each bucket the exact "requested" clamp value whenever all tasks
> > + * RUNNABLE in that bucket require the same clamp.
> > + */
> > +static inline void uclamp_rq_inc_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp, tsk_clamp;
> > +
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks++;
> > +
> > + /*
> > +  * Local clamping: rq's buckets always track the max "requested"
> > +  * clamp value from all RUNNABLE tasks in that bucket.
> > +  */
> > + tsk_clamp = p->uclamp[clamp_id].value;
> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > + rq->uclamp[clamp_id].bucket[bucket_id].value = max(bkt_clamp, 
> > tsk_clamp);
>
> So, if I read this correct:
>
>  - here we track a max value in a bucket,
>
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + WRITE_ONCE(rq->uclamp[clamp_id].value, max(rq_clamp, tsk_clamp));
> > +}
> > +
> > +/*
> > + * When a task is dequeued from a rq, the clamp bucket reference counted by
> > + * the task is released. If this is the last task reference counting the 
> > rq's
> > + * max active clamp value, then the rq's clamp value is updated.
> > + * Both the tasks reference counter and the rq's cached clamp values are
> > + * expected to be always valid, if we detect they are not we skip the 
> > updates,
> > + * enforce a consistent state and warn.
> > + */
> > +static inline void uclamp_rq_dec_id(struct task_struct *p, struct rq *rq,
> > + unsigned int clamp_id)
> > +{
> > + unsigned int bucket_id = p->uclamp[clamp_id].bucket_id;
> > + unsigned int rq_clamp, bkt_clamp;
> > +
> > + SCHED_WARN_ON(!rq->uclamp[clamp_id].bucket[bucket_id].tasks);
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + rq->uclamp[clamp_id].bucket[bucket_id].tasks--;
> > +
> > + /*
> > +  * Keep "local clamping" simple and accept to (possibly) overboost
> > +  * still RUNNABLE tasks in the same bucket.
> > +  */
> > + if (likely(rq->uclamp[clamp_id].bucket[bucket_id].tasks))
> > + return;
>
> (Oh man, I hope that generates semi sane code; long live CSE passes I
> suppose)
>
> But we never decrement that bkt_clamp value on dequeue.
>
> > + bkt_clamp = rq->uclamp[clamp_id].bucket[bucket_id].value;
> > +
> > + /* The rq's clamp value is expected to always track the max */
> > + rq_clamp = READ_ONCE(rq->uclamp[clamp_id].value);
> > + SCHED_WARN_ON(bkt_clamp > rq_clamp);
> > + if (bkt_clamp >= rq_clamp) {
>
> head hurts, this reads ==, how can this ever not be so?
>
> > + /*
> > +  * Reset rq's clamp bucket value to its nominal value whenever
> > +  * there are anymore RUNNABLE tasks refcounting it.
>
> -ENOPARSE
>
> > +  */
> > + rq->uclamp[clamp_id].bucket[bucket_id].value =
> > + uclamp_bucket_value(rq_clamp);
>
> But basically you decrement the bucket value to the nominal value.
>
> > + uclamp_rq_update(rq, clamp_id);
> > + }
> > +}
>
> Given all that, what is to stop the bucket value to climbing to
> uclamp_bucket_value(+1)-1 and staying there (provided there's someone
> runnable)?
>
> Why are we doing this... ?

I agree with Peter, this part of the patch was the hardest to read.
SCHED_WARN_ON line makes sense to me. The condition that follows and
the following comment are a little baffling. Condition seems to
indicate that the code that follows should be executed only if we are
in the top-most occupied bucket (the bucket which has tasks and has
the highest uclamp value). So this bucket just lost its last task and
we should update rq->uclamp[clamp_id].value. However that's not
exactly what the code does... It also resets
rq->uclamp[clamp_id].bucket[bucket_id].value. So if I understand
correctly, unless the bucket that just lost its last task is the
top-most one its value will not be reset to nominal value. That looks
like a bug to me. Am I missing something?

Side note: some more explanation would be very helpful.

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-13 Thread Suren Baghdasaryan

On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi  wrote:
>
> Utilization clamping allows to clamp the CPU's utilization within a
> [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> that CPU. Each task references two "clamp buckets" defining its minimum
> and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> bucket is active if there is at least one RUNNABLE tasks enqueued on
> that CPU and refcounting that bucket.
>
> When a task is {en,de}queued {on,from} a rq, the set of active clamp
> buckets on that CPU can change. Since each clamp bucket enforces a
> different utilization clamp value, when the set of active clamp buckets
> changes, a new "aggregated" clamp value is computed for that CPU.
>
> Clamp values are always MAX aggregated for both util_min and util_max.
> This ensures that no tasks can affect the performance of other
> co-scheduled tasks which are more boosted (i.e. with higher util_min
> clamp) or less capped (i.e. with higher util_max clamp).
>
> Each task has a:
>task_struct::uclamp[clamp_id]::bucket_id
> to track the "bucket index" of the CPU's clamp bucket it refcounts while
> enqueued, for each clamp index (clamp_id).
>
> Each CPU's rq has a:
>rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> to track how many tasks, currently RUNNABLE on that CPU, refcount each
> clamp bucket (bucket_id) of a clamp index (clamp_id).
>
> Each CPU's rq has also a:
>rq::uclamp[clamp_id]::bucket[bucket_id].value
> to track the clamp value of each clamp bucket (bucket_id) of a clamp
> index (clamp_id).
>
> The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> to find a new MAX aggregated clamp value for a clamp_id. This operation
> is required only when we dequeue the last task of a clamp bucket
> tracking the current MAX aggregated clamp value. In these cases, the CPU
> is either entering IDLE or going to schedule a less boosted or more
> clamped task.
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line.

I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
here. uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
to 20. sizeof(long)*20 is already more than 64 bytes. What am I
missing?

> Add the basic data structures required to refcount, in each CPU's rq,
> the number of RUNNABLE tasks for each clamp bucket. Add also the max
> aggregation required to update the rq's clamp value at each
> enqueue/dequeue event.
>
> Use a simple linear mapping of clamp values into clamp buckets.
> Pre-compute and cache bucket_id to avoid integer divisions at
> enqueue/dequeue time.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
>
> ---
> Changes in v7:
>  Message-ID: <20190123191007.gg17...@hirez.programming.kicks-ass.net>
>  - removed buckets mapping code
>  - use a simpler linear mapping of clamp values into buckets
>  Message-ID: <20190124161443.lv2pw5fsspyelckq@e110439-lin>
>  - move this patch at the beginning of the series,
>in the attempt to make the overall series easier to digest by moving
>at the very beginning the core bits and main data structures
>  Others:
>  - update the mapping logic to use exactly and only
>UCLAMP_BUCKETS_COUNT buckets, i.e. no more "special" bucket
>  - update uclamp_rq_update() to do top-bottom max search
> ---
>  include/linux/log2.h   |  37 
>  include/linux/sched.h  |  39 
>  include/linux/sched/topology.h |   6 --
>  init/Kconfig   |  53 +++
>  kernel/sched/core.c| 165 +
>  kernel/sched/sched.h   |  59 +++-
>  6 files changed, 350 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/log2.h b/include/linux/log2.h
> index 2af7f77866d0..e2db25734532 100644
> --- a/include/linux/log2.h
> +++ b/include/linux/log2.h
> @@ -224,4 +224,41 @@ int __order_base_2(unsigned long n)
> ilog2((n) - 1) + 1) :   \
> __order_base_2(n)   \
>  )
> +
> +static inline __attribute__((const))
> +int __bits_per(unsigned long n)
> +{
> +   if (n < 2)
> +   return 1;
> +   if (is_power_of_2(n))
> +   return order_base_2(n) + 1;
> +   return order_base_2(n);
> +}
> +
> +/**
> + * bits_per - calculate the number of bits required for the argument
> + * @n: parameter
> + *
> + * This is constant-capable and can be used for compile time
> + * initiaizations, e.g bitfields.
> + *
> + * The first few values calculated by this routine:
> + * bf(0) = 1
> + * bf(1) = 1
> + * bf(2) = 2
> + * bf(3) = 2
> + * bf(4) = 3
> + * ... and so on.
> + */
> +#define bits_per(n)\
> +(  \
> +   __builtin_constant_p(n) ? ( \
> +   ((n) == 0 || (n) == 1) ? 1 : (  \
> +   ((n) & (n - 1)) =

Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

2019-03-13 Thread Suren Baghdasaryan

On Wed, Mar 13, 2019 at 9:16 AM Patrick Bellasi  wrote:
>
> On 13-Mar 15:12, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:41AM +, Patrick Bellasi wrote:
> > > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int 
> > > clamp_id,
> > > +unsigned int clamp_value)
> > > +{
> > > +   /* Reset max-clamp retention only on idle exit */
> > > +   if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > > +   return;
> > > +
> > > +   WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > > +
> > > +   /*
> > > +* This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> > > +* (after). The idle flag is reset only the second time, when we know
> > > +* that UCLAMP_MIN has been already updated.
> >
> > Why do we care? That is, what is this comment trying to tell us.
>
> Right, the code is clear enough, I'll remove this comment.

It would be probably even clearer if rq->uclamp_flags &=
~UCLAMP_FLAG_IDLE is done from inside uclamp_rq_inc after
uclamp_rq_inc_id for both clamps is called.

> >
> > > +*/
> > > +   if (clamp_id == UCLAMP_MAX)
> > > +   rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> > > +}
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-14 Thread Suren Baghdasaryan

On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi  wrote:
>
> On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi  
> > wrote:
> > >
> > > Utilization clamping allows to clamp the CPU's utilization within a
> > > [util_min, util_max] range, depending on the set of RUNNABLE tasks on
> > > that CPU. Each task references two "clamp buckets" defining its minimum
> > > and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
> > > bucket is active if there is at least one RUNNABLE tasks enqueued on
> > > that CPU and refcounting that bucket.
> > >
> > > When a task is {en,de}queued {on,from} a rq, the set of active clamp
> > > buckets on that CPU can change. Since each clamp bucket enforces a
> > > different utilization clamp value, when the set of active clamp buckets
> > > changes, a new "aggregated" clamp value is computed for that CPU.
> > >
> > > Clamp values are always MAX aggregated for both util_min and util_max.
> > > This ensures that no tasks can affect the performance of other
> > > co-scheduled tasks which are more boosted (i.e. with higher util_min
> > > clamp) or less capped (i.e. with higher util_max clamp).
> > >
> > > Each task has a:
> > >task_struct::uclamp[clamp_id]::bucket_id
> > > to track the "bucket index" of the CPU's clamp bucket it refcounts while
> > > enqueued, for each clamp index (clamp_id).
> > >
> > > Each CPU's rq has a:
> > >rq::uclamp[clamp_id]::bucket[bucket_id].tasks
> > > to track how many tasks, currently RUNNABLE on that CPU, refcount each
> > > clamp bucket (bucket_id) of a clamp index (clamp_id).
> > >
> > > Each CPU's rq has also a:
> > >rq::uclamp[clamp_id]::bucket[bucket_id].value
> > > to track the clamp value of each clamp bucket (bucket_id) of a clamp
> > > index (clamp_id).
> > >
> > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > to find a new MAX aggregated clamp value for a clamp_id. This operation
> > > is required only when we dequeue the last task of a clamp bucket
> > > tracking the current MAX aggregated clamp value. In these cases, the CPU
> > > is either entering IDLE or going to schedule a less boosted or more
> > > clamped task.
> > > The expected number of different clamp values, configured at build time,
> > > is small enough to fit the full unordered array into a single cache
> > > line.
> >
> > I assume you are talking about "struct uclamp_rq uclamp[UCLAMP_CNT]"
> > here.
>
> No, I'm talking about the rq::uclamp::bucket[clamp_id][], which is an
> array of:
>
>struct uclamp_bucket {
> unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
> unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
>};
>
> defined as part of:
>
>struct uclamp_rq {
> unsigned int value;
> struct uclamp_bucket bucket[UCLAMP_BUCKETS];
>};
>
>
> So, it's an array of UCLAMP_BUCKETS (value, tasks) pairs.
>
> > uclamp_rq size depends on UCLAMP_BUCKETS configurable to be up
> > to 20. sizeof(long)*20 is already more than 64 bytes. What am I
> > missing?
>
> Right, the comment above refers to the default configuration, which is
> 5 buckets. With that configuration we have:
>
>
> $> pahole kernel/sched/core.o
>
> ---8<---
>struct uclamp_bucket {
>long unsigned int  value:11; /* 0:53  8 */
>long unsigned int  tasks:53; /* 0: 0  8 */
>
>/* size: 8, cachelines: 1, members: 2 */
>/* last cacheline: 8 bytes */
>};
>
>struct uclamp_rq {
>unsigned int   value;/* 0 4 */
>
>/* XXX 4 bytes hole, try to pack */
>
>struct uclamp_bucket   bucket[5];/* 840 */
>
>/* size: 48, cachelines: 1, members: 2 */
>/* sum members: 44, holes: 1, sum holes: 4 */
>/* last cacheline: 48 bytes */
>};
>
>struct rq {
>// ...
>/* --- cacheline 2 boundary (128 bytes) --- */
>struct uclamp_rq   uclamp[2];/*   12896 */
>/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
>// ...
>};
> ---8<---
>
> Where you see the array fits int

Re: [PATCH v7 12/15] sched/core: uclamp: Propagate parent clamps

2019-03-14 Thread Suren Baghdasaryan

On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi  wrote:
>
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
>
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This is the actual clamp value enforced on tasks in a
> task group.

In patch 10 in this series you mentioned "b) do not enforce any
constraints and/or dependencies between the parent and its child
nodes"

This patch seems to change that behavior. If so, should it be documented?

> Since it can be interesting for userspace, e.g. system management
> software, to know exactly what the currently propagated/enforced
> configuration is, the effective clamp values are exposed to user-space
> by means of a new pair of read-only attributes
> cpu.util.{min,max}.effective.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
>
> ---
> Changes in v7:
>  Others:
>  - ensure clamp values are not tunable at root cgroup level
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  19 
>  kernel/sched/core.c | 118 +++-
>  2 files changed, 133 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index 47710a77f4fa..7aad2435e961 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -990,6 +990,16 @@ All time durations are in microseconds.
>  values similar to the sched_setattr(2). This minimum utilization
>  value is used to clamp the task specific minimum utilization clamp.
>
> +  cpu.util.min.effective
> +A read-only single value file which exists on non-root cgroups and
> +reports minimum utilization clamp value currently enforced on a task
> +group.
> +
> +The actual minimum utilization in the range [0, 1024].
> +
> +This value can be lower then cpu.util.min in case a parent cgroup
> +allows only smaller minimum utilization values.
> +
>cpu.util.max
>  A read-write single value file which exists on non-root cgroups.
>  The default is "1024". i.e. no utilization capping
> @@ -1000,6 +1010,15 @@ All time durations are in microseconds.
>  values similar to the sched_setattr(2). This maximum utilization
>  value is used to clamp the task specific maximum utilization clamp.
>
> +  cpu.util.max.effective
> +A read-only single value file which exists on non-root cgroups and
> +reports maximum utilization clamp value currently enforced on a task
> +group.
> +
> +The actual maximum utilization in the range [0, 1024].
> +
> +This value can be lower then cpu.util.max in case a parent cgroup
> +is enforcing a more restrictive clamping on max utilization.
>
>
>  Memory
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 122ab069ade5..1e54517acd58 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -720,6 +720,18 @@ static void set_load_weight(struct task_struct *p, bool 
> update_load)
>  }
>
>  #ifdef CONFIG_UCLAMP_TASK
> +/*
> + * Serializes updates of utilization clamp values
> + *
> + * The (slow-path) user-space triggers utilization clamp value updates which
> + * can require updates on (fast-path) scheduler's data structures used to
> + * support enqueue/dequeue operations.
> + * While the per-CPU rq lock protects fast-path update operations, user-space
> + * requests are serialized using a mutex to reduce the risk of conflicting
> + * updates or API abuses.
> + */
> +static DEFINE_MUTEX(uclamp_mutex);
> +
>  /* Max allowed minimum utilization */
>  unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
>
> @@ -1127,6 +1139,8 @@ static void __init init_uclamp(void)
> unsigned int value;
> int cpu;
>
> +   mutex_init(&uclamp_mutex);
> +
> for_each_possible_cpu(cpu) {
> memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
> cpu_rq(cpu)->uclamp_flags = 0;
> @@ -6758,6 +6772,10 @@ static inline int alloc_uclamp_sched_group(struct 
> task_group *tg,
> parent->uclamp[clamp_id].value;
> tg->uclamp[clamp_id].bucket_id =
> parent->uclamp[clamp_id].bucket_id;
> +   tg->uclamp[clamp_id].effective.value =
> +   parent->uclamp[clamp_id].effective.value;
> +   tg->uclamp[clamp_id].effective.bucket_id =
> +   parent->uclamp[clamp_id].effec

Re: [PATCH v7 01/15] sched/core: uclamp: Add CPU's clamp buckets refcounting

2019-03-14 Thread Suren Baghdasaryan

On Thu, Mar 14, 2019 at 8:41 AM Patrick Bellasi  wrote:
>
> On 14-Mar 08:29, Suren Baghdasaryan wrote:
> > On Thu, Mar 14, 2019 at 7:46 AM Patrick Bellasi  
> > wrote:
> > > On 13-Mar 14:32, Suren Baghdasaryan wrote:
> > > > On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi 
> > > >  wrote:
>
> [...]
>
> > > > > The rq::uclamp::bucket[clamp_id][] array is scanned every time we need
> > > > > to find a new MAX aggregated clamp value for a clamp_id. This 
> > > > > operation
> > > > > is required only when we dequeue the last task of a clamp bucket
> > > > > tracking the current MAX aggregated clamp value. In these cases, the 
> > > > > CPU
> > > > > is either entering IDLE or going to schedule a less boosted or more
> > > > > clamped task.
>
> The following:
>
> > > > > The expected number of different clamp values, configured at build 
> > > > > time,
> > > > > is small enough to fit the full unordered array into a single cache
> > > > > line.
>
> will read:
>
> The expected number of different clamp values, configured at build time,
> is small enough to fit the full unordered array into a single cache
> line for the default UCLAMP_BUCKETS configuration of 7 buckets.

I think keeping default to be 5 is good. As you mentioned it's a nice
round number and keeping it at the minimum also hints that this is not
a free resource and the more buckets you use the more you pay.
Documentation might say "to fit the full unordered array into a single
cache line for configurations of up to 7 buckets".

> [...]
>
> > Got it. From reading the documentation at the beginning my impression
> > was that whatever value I choose within allowed 5-20 range it would
> > still fit in a cache line. To disambiguate it might be worse
> > mentioning that this is true for the default value or for values up to
> > 7. Thanks!
>
> Right, I hope the above proposed change helps to clarify that.
>
> --
> #include 
>
> Patrick Bellasi

Re: [RFC] simple_lmk: Introduce Simple Low Memory Killer for Android

2019-03-15 Thread Suren Baghdasaryan

On Thu, Mar 14, 2019 at 9:37 PM Daniel Colascione  wrote:
>
> On Thu, Mar 14, 2019 at 8:16 PM Steven Rostedt  wrote:
> >
> > On Thu, 14 Mar 2019 13:49:11 -0700
> > Sultan Alsawaf  wrote:
> >
> > > Perhaps I'm missing something, but if you want to know when a process has 
> > > died
> > > after sending a SIGKILL to it, then why not just make the SIGKILL 
> > > optionally
> > > block until the process has died completely? It'd be rather trivial to 
> > > just
> > > store a pointer to an onstack completion inside the victim process' 
> > > task_struct,
> > > and then complete it in free_task().
> >
> > How would you implement such a method in userspace? kill() doesn't take
> > any parameters but the pid of the process you want to send a signal to,
> > and the signal to send. This would require a new system call, and be
> > quite a bit of work.
>
> That's what the pidfd work is for. Please read the original threads
> about the motivation and design of that facility.
>
> > If you can solve this with an ebpf program, I
> > strongly suggest you do that instead.
>
> Regarding process death notification: I will absolutely not support
> putting aBPF and perf trace events on the critical path of core system
> memory management functionality. Tracing and monitoring facilities are
> great for learning about the system, but they were never intended to
> be load-bearing. The proposed eBPF process-monitoring approach is just
> a variant of the netlink proposal we discussed previously on the pidfd
> threads; it has all of its drawbacks. We really need a core system
> call  --- really, we've needed robust process management since the
> creation of unix --- and I'm glad that we're finally getting it.
> Adding new system calls is not expensive; going to great lengths to
> avoid adding one is like calling a helicopter to avoid crossing the
> street. I don't think we should present an abuse of the debugging and
> performance monitoring infrastructure as an alternative to a robust
> and desperately-needed bit of core functionality that's neither hard
> to add nor complex to implement nor expensive to use.
>
> Regarding the proposal for a new kernel-side lmkd: when possible, the
> kernel should provide mechanism, not policy. Putting the low memory
> killer back into the kernel after we've spent significant effort
> making it possible for userspace to do that job. Compared to kernel
> code, more easily understood, more easily debuggable, more easily
> updated, and much safer. If we *can* move something out of the kernel,
> we should. This patch moves us in exactly the wrong direction. Yes, we
> need *something* that sits synchronously astride the page allocation
> path and does *something* to stop a busy beaver allocator that eats
> all the available memory before lmkd, even mlocked and realtime, can
> respond. The OOM killer is adequate for this very rare case.
>
> With respect to kill timing: Tim is right about the need for two
> levels of policy: first, a high-level process prioritization and
> memory-demand balancing scheme (which is what OOM score adjustment
> code in ActivityManager amounts to); and second, a low-level
> process-killing methodology that maximizes sustainable memory reclaim
> and minimizes unwanted side effects while killing those processes that
> should be dead. Both of these policies belong in userspace --- because
> they *can* be in userspace --- and userspace needs only a few tools,
> most of which already exist, to do a perfectly adequate job.
>
> We do want killed processes to die promptly. That's why I support
> boosting a process's priority somehow when lmkd is about to kill it.
> The precise way in which we do that --- involving not only actual
> priority, but scheduler knobs, cgroup assignment, core affinity, and
> so on --- is a complex topic best left to userspace. lmkd already has
> all the knobs it needs to implement whatever priority boosting policy
> it wants.
>
> Hell, once we add a pidfd_wait --- which I plan to work on, assuming
> nobody beats me to it, after pidfd_send_signal lands --- you can
> imagine a general-purpose priority inheritance mechanism expediting
> process death when a high-priority process waits on a pidfd_wait
> handle for a condemned process. You know you're on the right track
> design-wise when you start seeing this kind of elegant constructive
> interference between seemingly-unrelated features. What we don't need
> is some kind of blocking SIGKILL alternative or backdoor event
> delivery system.

When talking about pidfd_wait functionality do you mean something like
this: https://lore.kernel.org/patchwork/patch/345098/ ? I missed the
discussion about it, could you please point me to it?

> We definitely don't want to have to wait for a process's parent to
> reap it. Instead, we want to wait for it to become a zombie. That's
> why I designed my original exithand patch to fire death notification
> upon transition to the zombie state, not upon process table removal,
> an

[PATCH v4 1/1] psi: introduce psi monitor

2019-02-05 Thread Suren Baghdasaryan

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Johannes Weiner 
---

This is respin of:
  https://lwn.net/ml/linux-kernel/20190124211518.244221-1-surenb%40google.com/

First 4 patches in the series are in linux-next:
1. fs: kernfs: add poll file operation
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=6a78cef7ad8a1734477a1352dd04a97f1dc58a70
2. kernel: cgroup: add poll file operation
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c88177361203be291a49956b6c9d5ec164ea24b2
3. psi: introduce state_mask to represent stalled psi states
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=9d8a0c4a7f1c197de9c12bd53ef45fb6d273374e
4. psi: rename psi fields in preparation for psi trigger addition
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=0ef9bb049a4db519152a8664088f7ce34bbee5ac


This patch can be cleanly applied either over linux-next tree (tag: 
next-20190201)
or over linux-stable v5.0-rc4 after applying abovementioned 4 patches.


Changes in v4:
- Resolved conflict with "psi: fix aggregation idle shut-off" patch, as per 
Andrew Morton
- Replaced smp_mb__after_atomic() with smp_mb() for proper ordering, as per 
Peter
- Moved now=sched_clock() in psi_update_work() after mutex acquisition, as per 
Peter
- Expanded comments to explain why smp_mb() is needed in psi_update_work, as 
per Peter
- Fixed g->polling operation order in the diagram above psi_update_work(), as 
per Johannes
- Merged psi_trigger_parse() into psi_trigger_create(), as per Johannes 
- Replaced list_del_init with list_del in psi_trigger_destroy(), as per Minchan
- Replaced return value in get_recent_times and collect_percpu_times to
return-by-parameter, as per Minchan
- Renamed window_init into window_reset and reused it, as per Minchan
- Replaced kzalloc with kmalloc, as per Minchan
- Added explanation in psi.txt for min/max window size choices, as per Minchan
- Misc variable name cleanups, as per Minchan and Johannes

 Documentation/accounting/psi.txt | 107 ++
 include/linux/psi.h  |   8 +
 include/linux/psi_types.h|  59 
 kernel/cgroup/cgroup.c   |  95 +-
 kernel/sched/psi.c   | 559 +--
 5 files changed, 794 insertions(+), 34 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..4fb40fe94828 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -63,6 +63,110 @@ tracked and exported as well, to allow detection of latency 
spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.
 
+Monitoring for pressure thresholds
+==
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used:
+
+  
+
+For example writing "some 15 100" into /proc/pressure/memory
+would add 150ms threshold for partial memory stall measured within
+1sec time window. Writing "full 5 100" into /proc/pressure/io
+would add 50ms threshold for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+ther

Re: [PATCH] psi: fix aggregation idle shut-off

2019-02-05 Thread Suren Baghdasaryan

Hi Andrew,

On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton  wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner  wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
> >
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, 
> > int cpu,
> >   groupc->tasks[t]++;
> >
> >   write_seqcount_end(&groupc->seq);
> > -
> > - if (!delayed_work_pending(&group->clock_work))
> > - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> >  }
> >
> >  static struct psi_group *iterate_groups(struct task_struct *task, void 
> > **iter)
>
> This breaks Suren's "psi: introduce psi monitor":
>
> --- kernel/sched/psi.c~psi-introduce-psi-monitor
> +++ kernel/sched/psi.c
> @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_
>
> write_seqcount_end(&groupc->seq);
>
> -   if (!delayed_work_pending(&group->clock_work))
> -   schedule_delayed_work(&group->clock_work, PSI_FREQ);
> +   /*
> +* Polling flag resets to 0 at the max rate of once per update window
> +* (at least 500ms interval). smp_wmb is required after group->polling
> +* 0-to-1 transition to order groupc->times and group->polling writes
> +* because stall detection logic in the slowpath relies on 
> groupc->times
> +* changing before group->polling. Explicit smp_wmb is missing because
> +* cmpxchg() implies smp_mb.
> +*/
> +   if ((state_mask & group->trigger_mask) &&
> +   atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> +   /*
> +* Start polling immediately even if the work is already
> +* scheduled
> +*/
> +   mod_delayed_work(system_wq, &group->clock_work, 1);
> +   } else {
> +   if (!delayed_work_pending(&group->clock_work))
> +   schedule_delayed_work(&group->clock_work, PSI_FREQ);
> +   }
>  }
>
> and I'm too lazy to go in and figure out how to fix it.
>
> If we're sure about "psi: fix aggregation idle shut-off" (and I am not)
> then can I ask for a redo of "psi: introduce psi monitor"?

I resolved the conflict with "psi: introduce psi monitor" patch and
posted v4 at 
https://lore.kernel.org/lkml/20190206023446.177362-1-sur...@google.com,
however please be advised that it also includes additional cleanup
changes yet to be reviewed.
The first 4 patches in this series are already in linux-next, so this
one should apply cleanly there. Please let me know if it creates any
other issues.
Thanks,
Suren.

[RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-10 Thread Suren Baghdasaryan

Add new SS_EXPEDITE flag to be used when sending SIGKILL via
pidfd_send_signal() syscall to allow expedited memory reclaim of the
victim process. The usage of this flag is currently limited to SIGKILL
signal and only to privileged users.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/sched/signal.h |  3 ++-
 include/linux/signal.h   | 11 ++-
 ipc/mqueue.c |  2 +-
 kernel/signal.c  | 37 
 kernel/time/itimer.c |  2 +-
 5 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e412c092c1e8..8a227633a058 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, 
struct task_struct *);
 extern void force_sigsegv(int sig, struct task_struct *p);
 extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
 extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid 
*pgrp);
-extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid 
*pid);
+extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
+   bool expedite);
 extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
const struct cred *);
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 9702016734b1..34b7852aa4a0 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
 } while (0);
 
 #ifdef CONFIG_PROC_FS
+
+/*
+ * SS_FLAGS values used in pidfd_send_signal:
+ *
+ * SS_EXPEDITE indicates desire to expedite the operation.
+ */
+#define SS_EXPEDITE0x0001
+
 struct seq_file;
 extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
-#endif
+
+#endif /* CONFIG_PROC_FS */
 
 #endif /* _LINUX_SIGNAL_H */
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index aea30530c472..27c66296e08e 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
rcu_read_unlock();
 
kill_pid_info(info->notify.sigev_signo,
- &sig_i, info->notify_owner);
+ &sig_i, info->notify_owner, false);
break;
case SIGEV_THREAD:
set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..02ed4332d17c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo 
*info, struct pid *pgrp)
return success ? 0 : retval;
 }
 
-int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
+int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
+ bool expedite)
 {
int error = -ESRCH;
struct task_struct *p;
@@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, 
struct pid *pid)
for (;;) {
rcu_read_lock();
p = pid_task(pid, PIDTYPE_PID);
-   if (p)
+   if (p) {
error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
+
+   /*
+* Ignore expedite_reclaim return value, it is best
+* effort only.
+*/
+   if (!error && expedite)
+   expedite_reclaim(p);
+   }
+
rcu_read_unlock();
if (likely(!p || error != -ESRCH))
return error;
@@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo 
*info, pid_t pid)
 {
int error;
rcu_read_lock();
-   error = kill_pid_info(sig, info, find_vpid(pid));
+   error = kill_pid_info(sig, info, find_vpid(pid), false);
rcu_read_unlock();
return error;
 }
@@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct 
kernel_siginfo *info, pid_t pid)
 
if (pid > 0) {
rcu_read_lock();
-   ret = kill_pid_info(sig, info, find_vpid(pid));
+   ret = kill_pid_info(sig, info, find_vpid(pid), false);
rcu_read_unlock();
return ret;
}
@@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
 
 int kill_pid(struct pid *pid, int sig, int priv)
 {
-   return kill_pid_info(sig, __si_special(priv), pid);
+   return kill_pid_info(sig, __si_special(priv), pid, false);
 }
 EXPORT_

[RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c

2019-04-10 Thread Suren Baghdasaryan

Create an API to allow users outside of oom_kill.c to mark a victim and
wake up oom_reaper thread for expedited memory reclaim of the process being
killed.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c   | 15 +++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index d07992009265..6c043c7518c1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -112,6 +112,7 @@ extern unsigned long oom_badness(struct task_struct *p,
unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
+extern bool expedite_reclaim(struct task_struct *task);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a2484884cfd..6449710c8a06 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc)
return !!oc->chosen;
 }
 
+bool expedite_reclaim(struct task_struct *task)
+{
+   bool res = false;
+
+   task_lock(task);
+   if (task_will_free_mem(task)) {
+   mark_oom_victim(task);
+   wake_oom_reaper(task);
+   res = true;
+   }
+   task_unlock(task);
+
+   return res;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
-- 
2.21.0.392.gf8f6787159e-goog

[RFC 0/2] opportunistic memory reclaim of a killed process

2019-04-10 Thread Suren Baghdasaryan

The time to kill a process and free its memory can be critical when the
killing was done to prevent memory shortages affecting system
responsiveness.

In the case of Android, where processes can be restarted easily, killing a
less important background process is preferred to delaying or throttling
an interactive foreground process. At the same time unnecessary kills
should be avoided as they cause delays when the killed process is needed
again. This requires a balanced decision from the system software about
how long a kill can be postponed in the hope that memory usage will
decrease without such drastic measures.

As killing a process and reclaiming its memory is not an instant operation,
a margin of free memory has to be maintained to prevent system performance
deterioration while memory of the killed process is being reclaimed. The
size of this margin depends on the minimum reclaim rate to cover the
worst-case scenario and this minimum rate should be deterministic.

Note that on asymmetric architectures like ARM big.LITTLE the reclaim rate
can vary dramatically depending on which core it’s performed at (see test
results). It’s a usual scenario that a non-essential victim process is
being restricted to a less performant or throttled CPU for power saving
purposes. This makes the worst-case reclaim rate scenario very probable.

The cases when victim’s memory reclaim can be delayed further due to
process being blocked in an uninterruptible sleep or when it performs a
time-consuming operation makes the reclaim time even more unpredictable.

Increasing memory reclaim rate and making it more deterministic would
allow for a smaller free memory margin and would lead to more opportunities
to avoid killing a process.

Note that while other strategies like throttling memory allocations are
viable and can be employed for other non-essential processes they would
affect user experience if applied towards an interactive process.

Proposed solution uses existing oom-reaper thread to increase memory
reclaim rate of a killed process and to make this rate more deterministic.
By no means the proposed solution is considered the best and was chosen
because it was simple to implement and allowed for test data collection.
The downside of this solution is that it requires additional “expedite”
hint for something which has to be fast in all cases. Would be great to
find a way that does not require additional hints.

Other possible approaches include:
- Implementing a dedicated syscall to perform opportunistic reclaim in the
context of the process waiting for the victim’s death. A natural boost
bonus occurs if the waiting process has high or RT priority and is not
limited by cpuset cgroup in its CPU choices.
- Implement a mechanism that would perform opportunistic reclaim if it’s
possible unconditionally (similar to checks in task_will_free_mem()).
- Implement opportunistic reclaim that uses shrinker interface, PSI or
other memory pressure indications as a hint to engage.

Test details:
Tests are performed on a Qualcomm® Snapdragon™ 845 8-core ARM big.LITTLE
system with 4 little cores (0.3-1.6GHz) and 4 big cores (0.8-2.5GHz)
running Android.
Memory reclaim speed was measured using signal/signal_generate,
kmem/rss_stat and sched/sched_process_exit traces.

Test results:
powersave governor, min freq
normal kills  expedited kills
little  856 MB/sec3236 MB/sec
big 5084 MB/sec   6144 MB/sec

performance governor, max freq
normal kills  expedited kills
little  5602 MB/sec   8144 MB/sec
big 14656 MB/sec  12398 MB/sec

schedutil governor (default)
normal kills  expedited kills
little  2386 MB/sec   3908 MB/sec
big 7282 MB/sec   6820-16386 MB/sec
=
min reclaim speed:  856 MB/sec3236 MB/sec

The patches are based on 5.1-rc1

Suren Baghdasaryan (2):
  mm: oom: expose expedite_reclaim to use oom_reaper outside of
oom_kill.c
  signal: extend pidfd_send_signal() to allow expedited process killing

 include/linux/oom.h  |  1 +
 include/linux/sched/signal.h |  3 ++-
 include/linux/signal.h   | 11 ++-
 ipc/mqueue.c |  2 +-
 kernel/signal.c  | 37 
 kernel/time/itimer.c |  2 +-
 mm/oom_kill.c| 15 +++
 7 files changed, 59 insertions(+), 12 deletions(-)

-- 
2.21.0.392.gf8f6787159e-goog

Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-11 Thread Suren Baghdasaryan

Thanks for the feedback!
Just to be clear, this implementation is used in this RFC as a
reference to explain the intent. To be honest I don't think it will be
adopted as is even if the idea survives scrutiny.

On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner  wrote:
>
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/sched/signal.h |  3 ++-
> >  include/linux/signal.h   | 11 ++-
> >  ipc/mqueue.c |  2 +-
> >  kernel/signal.c  | 37 
> >  kernel/time/itimer.c |  2 +-
> >  5 files changed, 43 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > index e412c092c1e8..8a227633a058 100644
> > --- a/include/linux/sched/signal.h
> > +++ b/include/linux/sched/signal.h
> > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, 
> > struct task_struct *);
> >  extern void force_sigsegv(int sig, struct task_struct *p);
> >  extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct 
> > *);
> >  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct 
> > pid *pgrp);
> > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid 
> > *pid);
> > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid 
> > *pid,
> > + bool expedite);
> >  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid 
> > *,
> >   const struct cred *);
> >  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index 9702016734b1..34b7852aa4a0 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
> >  } while (0);
> >
> >  #ifdef CONFIG_PROC_FS
> > +
> > +/*
> > + * SS_FLAGS values used in pidfd_send_signal:
> > + *
> > + * SS_EXPEDITE indicates desire to expedite the operation.
> > + */
> > +#define SS_EXPEDITE  0x0001
>
> Does this make sense as an SS_* flag?
> How does this relate to the signal stack?

It doesn't, so I agree that the name should be changed.
PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate.

> Is there any intention to ever use this flag with stack_t?
>
> New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
> PIDFD_SIGNAL_THREAD.)
> And since this is exposed to userspace in contrast to the mm internal
> naming it should be something more easily understandable like
> PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.
>
> > +
> >  struct seq_file;
> >  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> > -#endif
> > +
> > +#endif /* CONFIG_PROC_FS */
> >
> >  #endif /* _LINUX_SIGNAL_H */
> > diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> > index aea30530c472..27c66296e08e 100644
> > --- a/ipc/mqueue.c
> > +++ b/ipc/mqueue.c
> > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
> >   rcu_read_unlock();
> >
> >   kill_pid_info(info->notify.sigev_signo,
> > -   &sig_i, info->notify_owner);
> > +   &sig_i, info->notify_owner, false);
> >   break;
> >   case SIGEV_THREAD:
> >   set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index f98448cf2def..02ed4332d17c 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -43,6 +43,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define CREATE_TRACE_POINTS
> >  #include 
> > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo 
> > *info, struct pid *pgrp)
> >   return success ? 0 : retval;
> >  }
> >
> > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> > +int kill_pid_info(int sig, struct kernel_siginfo *info

Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing

2019-04-11 Thread Suren Baghdasaryan

On Thu, Apr 11, 2019 at 8:18 AM Suren Baghdasaryan  wrote:
>
> Thanks for the feedback!
> Just to be clear, this implementation is used in this RFC as a
> reference to explain the intent. To be honest I don't think it will be
> adopted as is even if the idea survives scrutiny.
>
> On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner  
> wrote:
> >
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  include/linux/sched/signal.h |  3 ++-
> > >  include/linux/signal.h   | 11 ++-
> > >  ipc/mqueue.c |  2 +-
> > >  kernel/signal.c  | 37 
> > >  kernel/time/itimer.c |  2 +-
> > >  5 files changed, 43 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > > index e412c092c1e8..8a227633a058 100644
> > > --- a/include/linux/sched/signal.h
> > > +++ b/include/linux/sched/signal.h
> > > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo 
> > > *, struct task_struct *);
> > >  extern void force_sigsegv(int sig, struct task_struct *p);
> > >  extern int force_sig_info(int, struct kernel_siginfo *, struct 
> > > task_struct *);
> > >  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct 
> > > pid *pgrp);
> > > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct 
> > > pid *pid);
> > > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct 
> > > pid *pid,
> > > + bool expedite);
> > >  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct 
> > > pid *,
> > >   const struct cred *);
> > >  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> > > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > > index 9702016734b1..34b7852aa4a0 100644
> > > --- a/include/linux/signal.h
> > > +++ b/include/linux/signal.h
> > > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
> > >  } while (0);
> > >
> > >  #ifdef CONFIG_PROC_FS
> > > +
> > > +/*
> > > + * SS_FLAGS values used in pidfd_send_signal:
> > > + *
> > > + * SS_EXPEDITE indicates desire to expedite the operation.
> > > + */
> > > +#define SS_EXPEDITE  0x0001
> >
> > Does this make sense as an SS_* flag?
> > How does this relate to the signal stack?
>
> It doesn't, so I agree that the name should be changed.
> PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate.
>
> > Is there any intention to ever use this flag with stack_t?
> >
> > New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
> > PIDFD_SIGNAL_THREAD.)
> > And since this is exposed to userspace in contrast to the mm internal
> > naming it should be something more easily understandable like
> > PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.
> >
> > > +
> > >  struct seq_file;
> > >  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> > > -#endif
> > > +
> > > +#endif /* CONFIG_PROC_FS */
> > >
> > >  #endif /* _LINUX_SIGNAL_H */
> > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> > > index aea30530c472..27c66296e08e 100644
> > > --- a/ipc/mqueue.c
> > > +++ b/ipc/mqueue.c
> > > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info 
> > > *info)
> > >   rcu_read_unlock();
> > >
> > >   kill_pid_info(info->notify.sigev_signo,
> > > -   &sig_i, info->notify_owner);
> > > +   &sig_i, info->notify_owner, false);
> > >   break;
> > >   case SIGEV_THREAD:
> > >   set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index f98448cf2def..02ed4332d17c 100644
> > > --- a/kernel

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
>
> On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> > >
> > > On Tue, Feb 9, 2021 at 6:46 PM Christian König  
> > > wrote:
> > > >
> > > >
> > > >
> > > > Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > > > > On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> > > > >  wrote:
> > > > >> Am 09.02.21 um 13:11 schrieb Christian König:
> > > > >>> [SNIP]
> > > > >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page 
> > > > >>>>>> *page)
> > > > >>>>>> +{
> > > > >>>>>> + spin_lock(&pool->lock);
> > > > >>>>>> + list_add_tail(&page->lru, &pool->items);
> > > > >>>>>> + pool->count++;
> > > > >>>>>> + atomic_long_add(1 << pool->order, &total_pages);
> > > > >>>>>> + spin_unlock(&pool->lock);
> > > > >>>>>> +
> > > > >>>>>> + mod_node_page_state(page_pgdat(page),
> > > > >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> > > > >>>>>> + 1 << pool->order);
> > > > >>>>> Hui what? What should that be good for?
> > > > >>>> This is a carryover from the ION page pool implementation:
> > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28&data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3D&reserved=0
> > > > >>>>
> > > > >>>>
> > > > >>>> My sense is it helps with the vmstat/meminfo accounting so folks 
> > > > >>>> can
> > > > >>>> see the cached pages are shrinkable/freeable. This maybe falls 
> > > > >>>> under
> > > > >>>> other dmabuf accounting/stats discussions, so I'm happy to remove 
> > > > >>>> it
> > > > >>>> for now, or let the drivers using the shared page pool logic handle
> > > > >>>> the accounting themselves?
> > > > >> Intentionally separated the discussion for that here.
> > > > >>
> > > > >> As far as I can see this is just bluntly incorrect.
> > > > >>
> > > > >> Either the page is reclaimable or it is part of our pool and freeable
> > > > >> through the shrinker, but never ever both.
> > > > > IIRC the original motivation for counting ION pooled pages as
> > > > > reclaimable was to include them into /proc/meminfo's MemAvailable
> > > > > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > > > > non-slab kernel pages" seems like a good place to account for them but
> > > > > I might be wrong.
> > > >
> > > > Yeah, that's what I see here as well. But exactly that is utterly 
> > > > nonsense.
> > > >
> > > > Those pages are not "free" in the sense that get_free_page could return
> > > > them directly.
> > >
> > > Well on Android that is kinda true, because Android has it's
> > > oom-killer (way back it was just a shrinker callback, not sure how it
> > > works now), which just shot down all the background apps. So at least
> > > some of that (everything used by background apps) is indeed
> > > reclaimable on Android.
> > >
> > > But that doesn't hold on Linux in general, so we can't really do this
> > > for common code.
> > >
> > > Also I had a long meeting with Suren, John and other googles
> > > yesterday, and the aim is now to try and support all the Android gpu
> > > memory accounting needs with cgroups. That should work, and it will
> > > allow Android to handle all the Androi

Re: [PATCH] dma-buf: system_heap: do not warn for costly allocation

2021-02-10 Thread Suren Baghdasaryan

The code looks fine to me. Description needs a bit polishing :)

On Wed, Feb 10, 2021 at 8:26 AM Minchan Kim  wrote:
>
> Linux VM is not hard to support PAGE_ALLOC_COSTLY_ODER allocation
> so normally expects driver passes __GFP_NOWARN in that case
> if they has fallback options.
>
> system_heap in dmabuf is the case so do not flood into demsg
> with the warning for recording more precious information logs.
> (below is ION warning example I got but dmabuf system heap is
> nothing different).

Suggestion:
Dmabuf system_heap allocation logic starts with the highest necessary
allocation order before falling back to lower orders. The requested
order can be higher than PAGE_ALLOC_COSTLY_ODER and failures to
allocate will flood dmesg with warnings. Such high-order allocations
are not unexpected and are handled by the system_heap's allocation
fallback mechanism.
Prevent these warnings when allocating higher than
PAGE_ALLOC_COSTLY_ODER pages using __GFP_NOWARN flag.

Below is ION warning example I got but dmabuf system heap is nothing different:

>
> [ 1233.911533][  T460] warn_alloc: 11 callbacks suppressed
> [ 1233.911539][  T460] allocator@2.0-s: page allocation failure: order:4, 
> mode:0x140dc2(GFP_HIGHUSER|__GFP_COMP|__GFP_ZERO), 
> nodemask=(null),cpuset=/,mems_allowed=0
> [ 1233.926235][  T460] Call trace:
> [ 1233.929370][  T460]  dump_backtrace+0x0/0x1d8
> [ 1233.933704][  T460]  show_stack+0x18/0x24
> [ 1233.937701][  T460]  dump_stack+0xc0/0x140
> [ 1233.941783][  T460]  warn_alloc+0xf4/0x148
> [ 1233.945862][  T460]  __alloc_pages_slowpath+0x9fc/0xa10
> [ 1233.951101][  T460]  __alloc_pages_nodemask+0x278/0x2c0
> [ 1233.956285][  T460]  ion_page_pool_alloc+0xd8/0x100
> [ 1233.961144][  T460]  ion_system_heap_allocate+0xbc/0x2f0
> [ 1233.966440][  T460]  ion_buffer_create+0x68/0x274
> [ 1233.971130][  T460]  ion_buffer_alloc+0x8c/0x110
> [ 1233.975733][  T460]  ion_dmabuf_alloc+0x44/0xe8
> [ 1233.980248][  T460]  ion_ioctl+0x100/0x320
> [ 1233.984332][  T460]  __arm64_sys_ioctl+0x90/0xc8
> [ 1233.988934][  T460]  el0_svc_common+0x9c/0x168
> [ 1233.993360][  T460]  do_el0_svc+0x1c/0x28
> [ 1233.997358][  T460]  el0_sync_handler+0xd8/0x250
> [ 1234.001989][  T460]  el0_sync+0x148/0x180
>
> Signed-off-by: Minchan Kim 
> ---
>  drivers/dma-buf/heaps/system_heap.c | 9 +++--
>  1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c 
> b/drivers/dma-buf/heaps/system_heap.c
> index 29e49ac17251..33c25a5e06f9 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -40,7 +40,7 @@ struct dma_heap_attachment {
> bool mapped;
>  };
>
> -#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> +#define HIGH_ORDER_GFP  (((GFP_HIGHUSER | __GFP_ZERO \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> | __GFP_COMP)
>  #define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_COMP)
> @@ -315,6 +315,7 @@ static struct page *alloc_largest_available(unsigned long 
> size,
> unsigned int max_order)
>  {
> struct page *page;
> +   unsigned long gfp_flags;
> int i;
>
> for (i = 0; i < NUM_ORDERS; i++) {
> @@ -323,7 +324,11 @@ static struct page *alloc_largest_available(unsigned 
> long size,
> if (max_order < orders[i])
> continue;
>
> -   page = alloc_pages(order_flags[i], orders[i]);
> +   gfp_flags = order_flags[i];
> +   if (orders[i] > PAGE_ALLOC_COSTLY_ORDER)
> +   gfp_flags |= __GFP_NOWARN;
> +
> +   page = alloc_pages(gfp_flags, orders[i]);
> if (!page)
> continue;
> return page;
> --
> 2.30.0.478.g8a0d178c01-goog
>

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 9:21 AM Daniel Vetter  wrote:
>
> On Wed, Feb 10, 2021 at 5:39 PM Suren Baghdasaryan  wrote:
> >
> > On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
> > >
> > > On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> > > > >
> > > > > On Tue, Feb 9, 2021 at 6:46 PM Christian König 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> > > > > > > On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> > > > > > >  wrote:
> > > > > > >> Am 09.02.21 um 13:11 schrieb Christian König:
> > > > > > >>> [SNIP]
> > > > > > >>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct 
> > > > > > >>>>>> page *page)
> > > > > > >>>>>> +{
> > > > > > >>>>>> + spin_lock(&pool->lock);
> > > > > > >>>>>> + list_add_tail(&page->lru, &pool->items);
> > > > > > >>>>>> + pool->count++;
> > > > > > >>>>>> + atomic_long_add(1 << pool->order, &total_pages);
> > > > > > >>>>>> + spin_unlock(&pool->lock);
> > > > > > >>>>>> +
> > > > > > >>>>>> + mod_node_page_state(page_pgdat(page),
> > > > > > >>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> > > > > > >>>>>> + 1 << pool->order);
> > > > > > >>>>> Hui what? What should that be good for?
> > > > > > >>>> This is a carryover from the ION page pool implementation:
> > > > > > >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28&data=04%7C01%7Cchristian.koenig%40amd.com%7Cdff8edcd4d147a5b08d8cd20cff2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637484888114923580%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9%2BIBC0tezSV6Ci4S3kWfW%2BQvJm4mdunn3dF6C0kyfCw%3D&reserved=0
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> My sense is it helps with the vmstat/meminfo accounting so 
> > > > > > >>>> folks can
> > > > > > >>>> see the cached pages are shrinkable/freeable. This maybe falls 
> > > > > > >>>> under
> > > > > > >>>> other dmabuf accounting/stats discussions, so I'm happy to 
> > > > > > >>>> remove it
> > > > > > >>>> for now, or let the drivers using the shared page pool logic 
> > > > > > >>>> handle
> > > > > > >>>> the accounting themselves?
> > > > > > >> Intentionally separated the discussion for that here.
> > > > > > >>
> > > > > > >> As far as I can see this is just bluntly incorrect.
> > > > > > >>
> > > > > > >> Either the page is reclaimable or it is part of our pool and 
> > > > > > >> freeable
> > > > > > >> through the shrinker, but never ever both.
> > > > > > > IIRC the original motivation for counting ION pooled pages as
> > > > > > > reclaimable was to include them into /proc/meminfo's MemAvailable
> > > > > > > calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> > > > > > > non-slab kernel pages" seems like a good place to account for 
> > > > > > > them but
> > > > > > > I might be wrong.
> > > > > >
> > > > > > Yeah, that's what I see here as well. But exactly that is utterly 
> > > > > > nonsense.
> > > > > >
> > > > > > Those pages are not "free" in the sense that get_free_page could 
> > > > > > return
> > > &

Re: [RFC][PATCH v6 1/7] drm: Add a sharable drm page-pool implementation

2021-02-10 Thread Suren Baghdasaryan

On Wed, Feb 10, 2021 at 10:32 AM Christian König
 wrote:
>
>
>
> Am 10.02.21 um 17:39 schrieb Suren Baghdasaryan:
> > On Wed, Feb 10, 2021 at 5:06 AM Daniel Vetter  wrote:
> >> On Tue, Feb 09, 2021 at 12:16:51PM -0800, Suren Baghdasaryan wrote:
> >>> On Tue, Feb 9, 2021 at 12:03 PM Daniel Vetter  wrote:
> >>>> On Tue, Feb 9, 2021 at 6:46 PM Christian König 
> >>>>  wrote:
> >>>>>
> >>>>>
> >>>>> Am 09.02.21 um 18:33 schrieb Suren Baghdasaryan:
> >>>>>> On Tue, Feb 9, 2021 at 4:57 AM Christian König 
> >>>>>>  wrote:
> >>>>>>> Am 09.02.21 um 13:11 schrieb Christian König:
> >>>>>>>> [SNIP]
> >>>>>>>>>>> +void drm_page_pool_add(struct drm_page_pool *pool, struct page 
> >>>>>>>>>>> *page)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + spin_lock(&pool->lock);
> >>>>>>>>>>> + list_add_tail(&page->lru, &pool->items);
> >>>>>>>>>>> + pool->count++;
> >>>>>>>>>>> + atomic_long_add(1 << pool->order, &total_pages);
> >>>>>>>>>>> + spin_unlock(&pool->lock);
> >>>>>>>>>>> +
> >>>>>>>>>>> + mod_node_page_state(page_pgdat(page),
> >>>>>>>>>>> NR_KERNEL_MISC_RECLAIMABLE,
> >>>>>>>>>>> + 1 << pool->order);
> >>>>>>>>>> Hui what? What should that be good for?
> >>>>>>>>> This is a carryover from the ION page pool implementation:
> >>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_page_pool.c%3Fh%3Dv5.10%23n28&data=04%7C01%7Cchristian.koenig%40amd.com%7Cbb7155447ee149a49f3a08d8cde2685d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637485719618339413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IYsJoAd7SUo12V7tS3CCRqNVm569iy%2FtoXQqm2MdC1g%3D&reserved=0
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My sense is it helps with the vmstat/meminfo accounting so folks can
> >>>>>>>>> see the cached pages are shrinkable/freeable. This maybe falls under
> >>>>>>>>> other dmabuf accounting/stats discussions, so I'm happy to remove it
> >>>>>>>>> for now, or let the drivers using the shared page pool logic handle
> >>>>>>>>> the accounting themselves?
> >>>>>>> Intentionally separated the discussion for that here.
> >>>>>>>
> >>>>>>> As far as I can see this is just bluntly incorrect.
> >>>>>>>
> >>>>>>> Either the page is reclaimable or it is part of our pool and freeable
> >>>>>>> through the shrinker, but never ever both.
> >>>>>> IIRC the original motivation for counting ION pooled pages as
> >>>>>> reclaimable was to include them into /proc/meminfo's MemAvailable
> >>>>>> calculations. NR_KERNEL_MISC_RECLAIMABLE defined as "reclaimable
> >>>>>> non-slab kernel pages" seems like a good place to account for them but
> >>>>>> I might be wrong.
> >>>>> Yeah, that's what I see here as well. But exactly that is utterly 
> >>>>> nonsense.
> >>>>>
> >>>>> Those pages are not "free" in the sense that get_free_page could return
> >>>>> them directly.
> >>>> Well on Android that is kinda true, because Android has it's
> >>>> oom-killer (way back it was just a shrinker callback, not sure how it
> >>>> works now), which just shot down all the background apps. So at least
> >>>> some of that (everything used by background apps) is indeed
> >>>> reclaimable on Android.
> >>>>
> >>>> But that doesn't hold on Linux in general, so we can't really do this
> >>>> for common code.
> >>>>
> >>>

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-18 Thread Suren Baghdasaryan

On Wed, Feb 17, 2021 at 11:55 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> >> Thanks. I added a few words to clarify this.>
> > Any link where I can see the final version?
>
> Sure:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/process_madvise.2
>
> Also rendered below.

Looks great. Thanks for improving it, Michael!

>
> Thanks,
>
> Michael
>
> NAME
>process_madvise - give advice about use of memory to a process
>
> SYNOPSIS
>#include 
>
>ssize_t process_madvise(int pidfd, const struct iovec *iovec,
>size_t vlen, int advice,
>unsigned int flags);
>
>Note: There is no glibc wrapper for this system call; see NOTES.
>
> DESCRIPTION
>The process_madvise() system call is used to give advice or direc‐
>tions to the kernel about the address ranges of another process or
>of  the  calling  process.  It provides the advice for the address
>ranges described by iovec and vlen.  The goal of such advice is to
>improve system or application performance.
>
>The  pidfd  argument  is a PID file descriptor (see pidfd_open(2))
>that specifies the process to which the advice is to be applied.
>
>The pointer iovec points to an array of iovec structures,  defined
>in  as:
>
>struct iovec {
>void  *iov_base;/* Starting address */
>size_t iov_len; /* Length of region */
>};
>
>The iovec structure describes address ranges beginning at iov_base
>address and with the size of iov_len bytes.
>
>The vlen specifies the number of elements in the iovec  structure.
>This value must be less than or equal to IOV_MAX (defined in its.h> or accessible via the call sysconf(_SC_IOV_MAX)).
>
>The advice argument is one of the following values:
>
>MADV_COLD
>   See madvise(2).
>
>MADV_PAGEOUT
>   See madvise(2).
>
>The flags argument is reserved for future use; currently, this ar‐
>gument must be specified as 0.
>
>The  vlen  and iovec arguments are checked before applying any ad‐
>vice.  If vlen is too big, or iovec is invalid, then an error will
>be returned immediately and no advice will be applied.
>
>The  advice might be applied to only a part of iovec if one of its
>elements points to an invalid memory region in the remote process.
>No further elements will be processed beyond that point.  (See the
>discussion regarding partial advice in RETURN VALUE.)
>
>Permission to apply advice to another process  is  governed  by  a
>ptrace   access   mode   PTRACE_MODE_READ_REALCREDS   check   (see
>ptrace(2)); in addition, because of the  performance  implications
>of applying the advice, the caller must have the CAP_SYS_ADMIN ca‐
>pability.
>
> RETURN VALUE
>On success, process_madvise() returns the number of bytes advised.
>This  return  value may be less than the total number of requested
>bytes, if an error occurred after some iovec elements were already
>processed.   The caller should check the return value to determine
>whether a partial advice occurred.
>
>On error, -1 is returned and errno is set to indicate the error.
>
> ERRORS
>EBADF  pidfd is not a valid PID file descriptor.
>
>EFAULT The memory described by iovec is outside the accessible ad‐
>   dress space of the process referred to by pidfd.
>
>EINVAL flags is not 0.
>
>EINVAL The  sum of the iov_len values of iovec overflows a ssize_t
>   value.
>
>EINVAL vlen is too large.
>
>ENOMEM Could not allocate memory for internal copies of the  iovec
>   structures.
>
>EPERM  The  caller  does not have permission to access the address
>   space of the process pidfd.
>
>ESRCH  The target process does not exist (i.e., it has  terminated
>   and been waited on).
>
> VERSIONS
>This  system  call first appeared in Linux 5.10.  Support for this
>system call is optional, depending on  the  setting  of  the  CON‐
>FIG_ADVISE_SYSCALLS configuration option.
>
> CONFORMING TO
>The process_madvise() system call is Linux-specific.
>
> NOTES
>Glibc does not provide a wrapper for this system call; call it us‐
>ing syscall(2).
>
> SEE ALSO
>madvise(2),  pidfd_open(2),   process_vm_readv(2),
>process_vm_write(2)
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 12:51 AM Christoph Hellwig  wrote:
>
> On Tue, Feb 02, 2021 at 12:44:44AM -0800, Suren Baghdasaryan wrote:
> > On Mon, Feb 1, 2021 at 11:03 PM Christoph Hellwig  
> > wrote:
> > >
> > > IMHO the
> > >
> > > BUG_ON(vma->vm_flags & VM_PFNMAP);
> > >
> > > in vm_insert_page should just become a WARN_ON_ONCE with an error
> > > return, and then we just need to gradually fix up the callers that
> > > trigger it instead of coming up with workarounds like this.
> >
> > For the existing vm_insert_page users this should be fine since
> > BUG_ON() guarantees that none of them sets VM_PFNMAP.
>
> Even for them WARN_ON_ONCE plus an actual error return is a way
> better assert that is much developer friendly.

Agree.

>
> > However, for the
> > system_heap_mmap I have one concern. When vm_insert_page returns an
> > error due to VM_PFNMAP flag, the whole mmap operation should fail
> > (system_heap_mmap returning an error leading to dma_buf_mmap failure).
> > Could there be cases when a heap user (DRM driver for example) would
> > be expected to work with a heap which requires VM_PFNMAP and at the
> > same time with another heap which requires !VM_PFNMAP? IOW, this
> > introduces a dependency between the heap and its
> > user. The user would have to know expectations of the heap it uses and
> > can't work with another heap that has the opposite expectation. This
> > usecase is purely theoretical and maybe I should not worry about it
> > for now?
>
> If such a case ever arises we can look into it.

Sounds good. I'll prepare a new patch and will post it later today. Thanks!

Re: [PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-02 Thread Suren Baghdasaryan

Hi Michael,

On Tue, Feb 2, 2021 at 2:45 AM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren (and Minchan and Michal)
>
> Thank you for the revisions!
>
> I've applied this patch, and done a few light edits.

Thanks!

>
> However, I have a questions about undocumented pieces in *madvise(2)*,
> as well as one other question. See below.
>
> On 2/2/21 6:30 AM, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
> > Reviewed-by: Michal Hocko 
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Applied fixes suggested by Michael Kerrisk
> > changes in v3:
> > - Added Michal's Reviewed-by
> > - Applied additional fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of another process or the calling
> > process. It provides the advice to the address ranges described by iovec
> > and vlen. The goal of such advice is to improve system or application
> > performance.
> >
> > The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
>
> I just noticed these version numbers now, and thought: they can't be
> right (because the system call appeared only in v5.11). So I removed
> them. But, of course in another sense the version numbers are (nearly)
> right, since these advice values were added for madvise(2) in Linux 5.4.
> However, they are not documented in the madvise(2) manual page. Is it
> correct to assume that MADV_COLD and MADV_PAGEOUT have exactly the same
> meaning in madvise(2) (but just for the calling process, of course)?

Correct. They should be added in the madvise(2) man page as well IMHO.

>
> > Deactive a given range of pages which will make them a more probable
>
> I changed: s/Deactive/Deactivate/

thanks!

>
> > reclaim target should there be a memory pressure. This is a
> > nondestructive operation. The advice might be ignored for some pages
> > in the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
>
> [...]
>
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory r

[PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
WARN_ON_ONCE and returning an error. This is to ensure users of the
vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
and get an indication of an error without panicing the kernel.
This will help identifying drivers that need to clear VM_PFNMAP before
using dmabuf system heap which is moving to use vm_insert_page.

Suggested-by: Christoph Hellwig 
Signed-off-by: Suren Baghdasaryan 
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index feff48e1465a..e503c9801cd9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1827,7 +1827,8 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned 
long addr,
return -EINVAL;
if (!(vma->vm_flags & VM_MIXEDMAP)) {
BUG_ON(mmap_read_trylock(vma->vm_mm));
-   BUG_ON(vma->vm_flags & VM_PFNMAP);
+   if (WARN_ON_ONCE(vma->vm_flags & VM_PFNMAP))
+   return -EINVAL;
vma->vm_flags |= VM_MIXEDMAP;
}
return insert_page(vma, addr, page, vma->vm_page_prot);
-- 
2.30.0.365.g02bc693789-goog

[PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

Currently system heap maps its buffers with VM_PFNMAP flag using
remap_pfn_range. This results in such buffers not being accounted
for in PSS calculations because vm treats this memory as having no
page structs. Without page structs there are no counters representing
how many processes are mapping a page and therefore PSS calculation
is impossible.
Historically, ION driver used to map its buffers as VM_PFNMAP areas
due to memory carveouts that did not have page structs [1]. That
is not the case anymore and it seems there was desire to move away
from remap_pfn_range [2].
Dmabuf system heap design inherits this ION behavior and maps its
pages using remap_pfn_range even though allocated pages are backed
by page structs.
Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
in [1]. This would allow correct PSS calculation for dmabufs.

[1] 
https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
[2] 
http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
(sorry, could not find lore links for these discussions)

Suggested-by: Laura Abbott 
Signed-off-by: Suren Baghdasaryan 
---
v1 posted at: https://lore.kernel.org/patchwork/patch/1372409/

changes in v2:
- removed VM_PFNMAP clearing part of the patch, per Minchan and Christoph
- created prerequisite patch to replace BUG_ON with WARN_ON_ONCE, per Christoph

 drivers/dma-buf/heaps/system_heap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c 
b/drivers/dma-buf/heaps/system_heap.c
index 17e0e9a68baf..4983f18cc2ce 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -203,8 +203,7 @@ static int system_heap_mmap(struct dma_buf *dmabuf, struct 
vm_area_struct *vma)
for_each_sgtable_page(table, &piter, vma->vm_pgoff) {
struct page *page = sg_page_iter_page(&piter);
 
-   ret = remap_pfn_range(vma, addr, page_to_pfn(page), PAGE_SIZE,
- vma->vm_page_prot);
+   ret = vm_insert_page(vma, addr, page);
if (ret)
return ret;
addr += PAGE_SIZE;
-- 
2.30.0.365.g02bc693789-goog

Re: [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:31 PM Minchan Kim  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > and get an indication of an error without panicing the kernel.
> > This will help identifying drivers that need to clear VM_PFNMAP before
> > using dmabuf system heap which is moving to use vm_insert_page.
> >
> > Suggested-by: Christoph Hellwig 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  mm/memory.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index feff48e1465a..e503c9801cd9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1827,7 +1827,8 @@ int vm_insert_page(struct vm_area_struct *vma, 
> > unsigned long addr,
> >   return -EINVAL;
> >   if (!(vma->vm_flags & VM_MIXEDMAP)) {
> >   BUG_ON(mmap_read_trylock(vma->vm_mm));
>
> Better to replace above BUG_ON with WARN_ON_ONCE, too?

If nobody objects I'll do that in the next respin. Thanks!

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:39 PM Minchan Kim  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:34PM -0800, Suren Baghdasaryan wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> > in [1]. This would allow correct PSS calculation for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Minchan Kim 
>
> A note: This patch makes dmabuf system heap accounted as PSS so
> if someone has relies on the size, they will see the bloat.
> IIRC, there was some debate whether PSS accounting for their
> buffer is correct or not. If it'd be a problem, we need to
> discuss how to solve it(maybe, vma->vm_flags and reintroduce
> remap_pfn_range for them to be respected).

I did not see debates about not including *mapped* dmabufs into PSS
calculation. I remember people were discussing how to account dmabufs
referred only by the FD but that is a different discussion. If the
buffer is mapped into the address space of a process then IMHO
including it into PSS of that process is not controversial.

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 6:07 PM John Stultz  wrote:
>
> On Tue, Feb 2, 2021 at 4:31 PM Suren Baghdasaryan  wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> > in [1]. This would allow correct PSS calculation for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
>
> For consistency, do we need something similar for the cma heap as well?

Good question. Let me look closer into it.

>
> thanks
> -john

Re: [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-02 Thread Suren Baghdasaryan

On Tue, Feb 2, 2021 at 5:55 PM Matthew Wilcox  wrote:
>
> On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > and get an indication of an error without panicing the kernel.
> > This will help identifying drivers that need to clear VM_PFNMAP before
> > using dmabuf system heap which is moving to use vm_insert_page.
>
> NACK.
>
> The system may not _panic_, but it is clearly now _broken_.  The device
> doesn't work, and so the system is useless.  You haven't really improved
> anything here.  Just bloated the kernel with yet another _ONCE variable
> that in a normal system will never ever ever be triggered.

We had a discussion in https://lore.kernel.org/patchwork/patch/1372409
about how some DRM drivers set up their VMAs with VM_PFNMAP before
mapping them. We want to use vm_insert_page instead of remap_pfn_range
in the dmabuf heaps so that this memory is visible in PSS. However if
a driver that sets VM_PFNMAP tries to use a dmabuf heap, it will step
into this BUG_ON. We wanted to catch and gradually fix such drivers
but without causing a panic in the process. I hope this clarifies the
reasons why I'm making this change and I'm open to other ideas if they
would address this issue in a better way.

Re: [PATCH v2 2/2] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 12:06 AM Christian König
 wrote:
>
> Am 03.02.21 um 03:02 schrieb Suren Baghdasaryan:
> > On Tue, Feb 2, 2021 at 5:39 PM Minchan Kim  wrote:
> >> On Tue, Feb 02, 2021 at 04:31:34PM -0800, Suren Baghdasaryan wrote:
> >>> Currently system heap maps its buffers with VM_PFNMAP flag using
> >>> remap_pfn_range. This results in such buffers not being accounted
> >>> for in PSS calculations because vm treats this memory as having no
> >>> page structs. Without page structs there are no counters representing
> >>> how many processes are mapping a page and therefore PSS calculation
> >>> is impossible.
> >>> Historically, ION driver used to map its buffers as VM_PFNMAP areas
> >>> due to memory carveouts that did not have page structs [1]. That
> >>> is not the case anymore and it seems there was desire to move away
> >>> from remap_pfn_range [2].
> >>> Dmabuf system heap design inherits this ION behavior and maps its
> >>> pages using remap_pfn_range even though allocated pages are backed
> >>> by page structs.
> >>> Replace remap_pfn_range with vm_insert_page, following Laura's suggestion
> >>> in [1]. This would allow correct PSS calculation for dmabufs.
> >>>
> >>> [1] 
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdriverdev-devel.linuxdriverproject.narkive.com%2Fv0fJGpaD%2Fusing-ion-memory-for-direct-io&data=04%7C01%7Cchristian.koenig%40amd.com%7Cb4c145b86dd0472c943c08d8c7e7ba4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479145389160353%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=W1N%2B%2BlcFDaRSvXdSPe5hPNMRByHfGkU7Uc3cmM3FCTU%3D&reserved=0
> >>> [2] 
> >>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdriverdev.linuxdriverproject.org%2Fpipermail%2Fdriverdev-devel%2F2018-October%2F127519.html&data=04%7C01%7Cchristian.koenig%40amd.com%7Cb4c145b86dd0472c943c08d8c7e7ba4b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479145389160353%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jQxSzKEr52lUcAIx%2FuBHMJ7yOgof%2FVMlW9%2BB2f%2FoS%2FE%3D&reserved=0
> >>> (sorry, could not find lore links for these discussions)
> >>>
> >>> Suggested-by: Laura Abbott 
> >>> Signed-off-by: Suren Baghdasaryan 
> >> Reviewed-by: Minchan Kim 
> >>
> >> A note: This patch makes dmabuf system heap accounted as PSS so
> >> if someone has relies on the size, they will see the bloat.
> >> IIRC, there was some debate whether PSS accounting for their
> >> buffer is correct or not. If it'd be a problem, we need to
> >> discuss how to solve it(maybe, vma->vm_flags and reintroduce
> >> remap_pfn_range for them to be respected).
> > I did not see debates about not including *mapped* dmabufs into PSS
> > calculation. I remember people were discussing how to account dmabufs
> > referred only by the FD but that is a different discussion. If the
> > buffer is mapped into the address space of a process then IMHO
> > including it into PSS of that process is not controversial.
>
> Well, I think it is. And to be honest this doesn't looks like a good
> idea to me since it will eventually lead to double accounting of system
> heap DMA-bufs.

Thanks for the comment! Could you please expand on this double
accounting issue? Do you mean userspace could double account dmabufs
because it expects dmabufs not to be part of PSS or is there some
in-kernel accounting mechanism that would be broken by this?

>
> As discussed multiple times it is illegal to use the struct page of a
> DMA-buf. This case here is a bit special since it is the owner of the
> pages which does that, but I'm not sure if this won't cause problems
> elsewhere as well.

I would be happy to keep things as they are but calculating dmabuf
contribution to PSS without struct pages is extremely inefficient and
becomes a real pain when we consider the possibilities of partial
mappings, when not the entire dmabuf is being mapped.
Calculating this would require parsing /proc/pid/maps for the process,
finding dmabuf mappings and the size for each one, then parsing
/proc/pid/maps for ALL processes in the system to see if the same
dmabufs are used by other processes and only then calculating the PSS.
I hope that explains the desire to use already existing struct pages
to obtain PSS in a much more efficient way.

>
> A more appropriate solution would be to held processes accountable for
> resources they have allocated through device drivers.

Are you suggesting some new kernel mechanism to account resources
allocated by a process via a driver? If so, any details?

>
> Regards,
> Christian.
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 12:52 AM Daniel Vetter  wrote:
>
> On Wed, Feb 3, 2021 at 2:57 AM Matthew Wilcox  wrote:
> >
> > On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > > vm_insert_page that set VM_PFNMAP are notified of the wrong flag usage
> > > and get an indication of an error without panicing the kernel.
> > > This will help identifying drivers that need to clear VM_PFNMAP before
> > > using dmabuf system heap which is moving to use vm_insert_page.
> >
> > NACK.
> >
> > The system may not _panic_, but it is clearly now _broken_.  The device
> > doesn't work, and so the system is useless.  You haven't really improved
> > anything here.  Just bloated the kernel with yet another _ONCE variable
> > that in a normal system will never ever ever be triggered.
>
> Also, what the heck are you doing with your drivers? dma-buf mmap must
> call dma_buf_mmap(), even for forwarded/redirected mmaps from driver
> char nodes. If that doesn't work we have some issues with the calling
> contract for that function, not in vm_insert_page.

The particular issue I observed (details were posted in
https://lore.kernel.org/patchwork/patch/1372409) is that DRM drivers
set VM_PFNMAP flag (via a call to drm_gem_mmap_obj) before calling
dma_buf_mmap. Some drivers clear that flag but some don't. I could not
find the answer to why VM_PFNMAP is required for dmabuf mappings and
maybe someone can explain that here?
If there is a reason to set this flag other than historical use of
carveout memory then we wanted to catch such cases and fix the drivers
that moved to using dmabuf heaps. However maybe there are other
reasons and if so I would be very grateful if someone could explain
them. That would help me to come up with a better solution.

> Finally why exactly do we need to make this switch for system heap?
> I've recently looked at gup usage by random drivers, and found a lot
> of worrying things there. gup on dma-buf is really bad idea in
> general.

The reason for the switch is to be able to account dmabufs allocated
using dmabuf heaps to the processes that map them. The next patch in
this series https://lore.kernel.org/patchwork/patch/1374851
implementing the switch contains more details and there is an active
discussion there. Would you mind joining that discussion to keep it in
one place?
Thanks!

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-03 Thread Suren Baghdasaryan

On Wed, Feb 3, 2021 at 1:25 PM Daniel Vetter  wrote:
>
> On Wed, Feb 3, 2021 at 9:29 PM Daniel Vetter  wrote:
> >
> > On Wed, Feb 3, 2021 at 9:20 PM Suren Baghdasaryan  wrote:
> > >
> > > On Wed, Feb 3, 2021 at 12:52 AM Daniel Vetter  
> > > wrote:
> > > >
> > > > On Wed, Feb 3, 2021 at 2:57 AM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Tue, Feb 02, 2021 at 04:31:33PM -0800, Suren Baghdasaryan wrote:
> > > > > > Replace BUG_ON(vma->vm_flags & VM_PFNMAP) in vm_insert_page with
> > > > > > WARN_ON_ONCE and returning an error. This is to ensure users of the
> > > > > > vm_insert_page that set VM_PFNMAP are notified of the wrong flag 
> > > > > > usage
> > > > > > and get an indication of an error without panicing the kernel.
> > > > > > This will help identifying drivers that need to clear VM_PFNMAP 
> > > > > > before
> > > > > > using dmabuf system heap which is moving to use vm_insert_page.
> > > > >
> > > > > NACK.
> > > > >
> > > > > The system may not _panic_, but it is clearly now _broken_.  The 
> > > > > device
> > > > > doesn't work, and so the system is useless.  You haven't really 
> > > > > improved
> > > > > anything here.  Just bloated the kernel with yet another _ONCE 
> > > > > variable
> > > > > that in a normal system will never ever ever be triggered.
> > > >
> > > > Also, what the heck are you doing with your drivers? dma-buf mmap must
> > > > call dma_buf_mmap(), even for forwarded/redirected mmaps from driver
> > > > char nodes. If that doesn't work we have some issues with the calling
> > > > contract for that function, not in vm_insert_page.
> > >
> > > The particular issue I observed (details were posted in
> > > https://lore.kernel.org/patchwork/patch/1372409) is that DRM drivers
> > > set VM_PFNMAP flag (via a call to drm_gem_mmap_obj) before calling
> > > dma_buf_mmap. Some drivers clear that flag but some don't. I could not
> > > find the answer to why VM_PFNMAP is required for dmabuf mappings and
> > > maybe someone can explain that here?
> > > If there is a reason to set this flag other than historical use of
> > > carveout memory then we wanted to catch such cases and fix the drivers
> > > that moved to using dmabuf heaps. However maybe there are other
> > > reasons and if so I would be very grateful if someone could explain
> > > them. That would help me to come up with a better solution.
> > >
> > > > Finally why exactly do we need to make this switch for system heap?
> > > > I've recently looked at gup usage by random drivers, and found a lot
> > > > of worrying things there. gup on dma-buf is really bad idea in
> > > > general.
> > >
> > > The reason for the switch is to be able to account dmabufs allocated
> > > using dmabuf heaps to the processes that map them. The next patch in
> > > this series https://lore.kernel.org/patchwork/patch/1374851
> > > implementing the switch contains more details and there is an active
> > > discussion there. Would you mind joining that discussion to keep it in
> > > one place?
> >
> > How many semi-unrelated buffer accounting schemes does google come up with?
> >
> > We're at three with this one.
> >
> > And also we _cannot_ required that all dma-bufs are backed by struct
> > page, so requiring struct page to make this work is a no-go.
> >
> > Second, we do not want to all get_user_pages and friends to work on
> > dma-buf, it causes all kinds of pain. Yes on SoC where dma-buf are
> > exclusively in system memory you can maybe get away with this, but
> > dma-buf is supposed to work in more places than just Android SoCs.
>
> I just realized that vm_inser_page doesn't even work for CMA, it would
> upset get_user_pages pretty badly - you're trying to pin a page in
> ZONE_MOVEABLE but you can't move it because it's rather special.
> VM_SPECIAL is exactly meant to catch this stuff.

Thanks for the input, Daniel! Let me think about the cases you pointed out.

IMHO, the issue with PSS is the difficulty of calculating this metric
without struct page usage. I don't think that problem becomes easier
if we use cgroups or any other API. I wanted to enable existing PSS
calculation mechanisms for the dmabufs known to be backed by struct
pages (since we know how the heap allocated that memory), but sounds
like this would lead to problems that I did not consider.
Thanks,
Suren.

> -Daniel
>
> > If you want to account dma-bufs, and gpu memory in general, I'd say
> > the solid solution is cgroups. There's patches floating around. And
> > given that Google Android can't even agree internally on what exactly
> > you want I'd say we just need to cut over to that and make it happen.
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 1:13 AM Christoph Hellwig  wrote:
>
> On Thu, Jan 28, 2021 at 12:38:17AM -0800, Suren Baghdasaryan wrote:
> > Currently system heap maps its buffers with VM_PFNMAP flag using
> > remap_pfn_range. This results in such buffers not being accounted
> > for in PSS calculations because vm treats this memory as having no
> > page structs. Without page structs there are no counters representing
> > how many processes are mapping a page and therefore PSS calculation
> > is impossible.
> > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > due to memory carveouts that did not have page structs [1]. That
> > is not the case anymore and it seems there was desire to move away
> > from remap_pfn_range [2].
> > Dmabuf system heap design inherits this ION behavior and maps its
> > pages using remap_pfn_range even though allocated pages are backed
> > by page structs.
> > Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the
> > system heap and replace remap_pfn_range with vm_insert_page, following
> > Laura's suggestion in [1]. This would allow correct PSS calculation
> > for dmabufs.
> >
> > [1] 
> > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > [2] 
> > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > (sorry, could not find lore links for these discussions)
> >
> > Suggested-by: Laura Abbott 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  drivers/dma-buf/heaps/system_heap.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c 
> > b/drivers/dma-buf/heaps/system_heap.c
> > index 17e0e9a68baf..0e92e42b2251 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf *dmabuf, 
> > struct vm_area_struct *vma)
> >   struct sg_page_iter piter;
> >   int ret;
> >
> > + /* All pages are backed by a "struct page" */
> > + vma->vm_flags &= ~VM_PFNMAP;
>
> Why do we clear this flag?  It shouldn't even be set here as far as I
> can tell.

Thanks for the question, Christoph.
I tracked down that flag being set by drm_gem_mmap_obj() which DRM
drivers use to "Set up the VMA to prepare mapping of the GEM object"
(according to drm_gem_mmap_obj comments). I also see a pattern in
several DMR drivers to call drm_gem_mmap_obj()/drm_gem_mmap(), then
clear VM_PFNMAP and then map the VMA (for example here:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/rockchip/rockchip_drm_gem.c#L246).
I thought that dmabuf allocator (in this case the system heap) would
be the right place to set these flags because it controls how memory
is allocated before mapping. However it's quite possible that I'm
missing the real reason for VM_PFNMAP being set in drm_gem_mmap_obj()
before dma_buf_mmap() is called. I could not find the answer to that,
so I hope someone here can clarify that.

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-28 Thread Suren Baghdasaryan

On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > > >
> > > > > On 01/12, Michal Hocko wrote:
> > > > > >
> > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > >
> > > > > > > What we want is the ability for one process to influence another 
> > > > > > > process
> > > > > > > in order to optimize performance across the entire system while 
> > > > > > > leaving
> > > > > > > the security boundary intact.
> > > > > > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > metadata
> > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > >
> > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > cannot
> > > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > > always been that this is requred to RO access to the address space. 
> > > > > > But
> > > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > > actual
> > > > > > documentation for the existing modes?
> > > > > >
> > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > >
> > > > > Can't comment, sorry. I never understood these security checks and 
> > > > > never tried.
> > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have no 
> > > > > idea what
> > > > > is the difference.
> >
> > Yama in particular only does its checks on ATTACH and ignores READ,
> > that's the difference you're probably most likely to encounter on a
> > normal desktop system, since some distros turn Yama on by default.
> > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > still works; so you can see things like detailed memory usage
> > information and such, but you're not supposed to be able to directly
> > peek into a running SSH client and inject data into the existing SSH
> > connection, or steal the cryptographic keys for the current
> > connection, or something like that.
> >
> > > > I haven't seen a written explanation on ptrace modes but when I
> > > > consulted Jann his explanation was:
> > > >
> > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > the specified domain, across UID boundaries.
> > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > > specified domain, across UID boundaries.
> > >
> > > Maybe this would be a good start to document expectations. Some more
> > > practical examples where the difference is visible would be great as
> > > well.
> >
> > Before documenting the behavior, it would be a good idea to figure out
> > what to do with perf_event_open(). That one's weird in that it only
> > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > like userspace stack and register contents (if perf_event_paranoid is
> > 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> > should be a level in between that allows fully inspecting the process
> > (for purposes like profiling) but without the ability to corrupt its
> > memory or registers or things like that. Or maybe perf_event_open()
> > should just use the ATTACH mode.
>
> Thanks for the clarification. I still cannot say I would have a good
> mental picture. Having something in Documentation/core-api/ sounds
> really needed. Wrt to perf_event_open it sounds really odd it can do
> more than other places restrict indeed. Something for the respective
> maintainer but I strongly suspect people simply copy the pattern from
> other places because the expected semantic is not really clear.
>

Sorry, back to the matters of this patch. Are there any actionable
items for me to take care of before it can be accepted? The only
request from Andrew to write a man page is being worked on at
https://lore.kernel.org/linux-mm/20210120202337.1481402-1-sur...@google.com/
and I'll follow up with the next version. I also CC'ed stable@ for
this to be included into 5.10 per Andrew's request. That CC was lost
at some point, so CC'ing again.

I do not see anything else on this patch to fix. Please chime in if
there are any more concerns, otherwise I would ask Andrew to take it
into mm-tree and stable@ to apply it to 5.10.
Thanks!


> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are 
> > clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>> On success, process_madvise() returns the number of bytes advised. 
> >>> This
> >>> return value may be less than the total number of requested bytes, if 
> >>> an
> >>> error occurred. The caller should check return value to determine 
> >>> whether
> >>> a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >  On success, process_madvise() returns the number of bytes advised. This
> >  return value may be less than the total number of requested bytes, if 
> > an
> >  error occurred after some iovec elements were already processed. The 
> > caller
> >  should check the return value to determine whether a partial
> > advice occurred.
> >
> > On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.

ack

>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...

ack

>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,   # so that comma is not italic
> .BR process_madvise ()
> etc.

Aha! Got it now. It's clear after your example. Thanks!

>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.
> ]]

Perfect! I'll use that.

>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.

ack, will use pidfd_send_signal(2) version.

>
> [...]
>
> Thanks,
>
> Michael

I'll include your and Michal's suggestions and will post the next
version later today or tomorrow morning.
Thanks for the guidance!

>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

[PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] 
https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311

Signed-off-by: Suren Baghdasaryan 
---
changes in v2:
- Changed description of MADV_COLD per Michal Hocko's suggestion
- Appled fixes suggested by Michael Kerrisk

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include 

ssize_t process_madvise(int pidfd,
   const struct iovec *iovec,
   unsigned long vlen,
   int advice,
   unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges of other process as well as of
the calling process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve system
or application performance.

The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined in
 as:

struct iovec {
void  *iov_base;/* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec structure describes address ranges beginning at iov_base address
and with the size of iov_len bytes.

The vlen represents the number of elements in the iovec structure.

The advice argument is one of the values listed below.

  Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts
in the madvise(2) interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a non-
destructive operation. The advice might be ignored for some pages in
the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back to the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument
must be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in  or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If
the vlen is too big, or iovec is invalid, an error will be returned
immediately.

The hint might be applied to a part of iovec if one of its elements points
to an invalid memory region in the remote process. No further elements will
be processed beyond that point.

Permission to provide a hint to another process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
the caller must have the CAP_SYS_ADMIN capability due to performance
implications of applying the hint.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred after some iovec elements were already processed. The caller
should check the return value to determine whether a partial advice
occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EFAULT The memory described by iovec is outside the accessible address
   space of the process referred to by pidfd.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
   structures.
EPERM The caller does not have permission to access the address space of
  the process pidfd.
ESRCH The target process does not exist (i.e., it has terminated and been
  waited on).
EBADF pidfd is not a valid PID file descriptor.

VERSIONS
This system call first appeared in Linux 5.10, Support for this system
call is optional, depending on the setting of the CONFIG_ADVISE_SYSCALLS
configuration option.

SEE ALSO
madv

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:51 AM Suren Baghdasaryan  wrote:
>
> On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  wrote:
> > > > > >
> > > > > > On 01/12, Michal Hocko wrote:
> > > > > > >
> > > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > > >
> > > > > > > > What we want is the ability for one process to influence 
> > > > > > > > another process
> > > > > > > > in order to optimize performance across the entire system while 
> > > > > > > > leaving
> > > > > > > > the security boundary intact.
> > > > > > > > Replace PTRACE_MODE_ATTACH with a combination of 
> > > > > > > > PTRACE_MODE_READ
> > > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > > metadata
> > > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > > >
> > > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > > cannot
> > > > > > > really judge whether MODE_READ is sufficient. My understanding has
> > > > > > > always been that this is requred to RO access to the address 
> > > > > > > space. But
> > > > > > > this operation clearly has a visible side effect. Do we have any 
> > > > > > > actual
> > > > > > > documentation for the existing modes?
> > > > > > >
> > > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > > >
> > > > > > Can't comment, sorry. I never understood these security checks and 
> > > > > > never tried.
> > > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I have 
> > > > > > no idea what
> > > > > > is the difference.
> > >
> > > Yama in particular only does its checks on ATTACH and ignores READ,
> > > that's the difference you're probably most likely to encounter on a
> > > normal desktop system, since some distros turn Yama on by default.
> > > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > > still works; so you can see things like detailed memory usage
> > > information and such, but you're not supposed to be able to directly
> > > peek into a running SSH client and inject data into the existing SSH
> > > connection, or steal the cryptographic keys for the current
> > > connection, or something like that.
> > >
> > > > > I haven't seen a written explanation on ptrace modes but when I
> > > > > consulted Jann his explanation was:
> > > > >
> > > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > > the specified domain, across UID boundaries.
> > > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with the
> > > > > specified domain, across UID boundaries.
> > > >
> > > > Maybe this would be a good start to document expectations. Some more
> > > > practical examples where the difference is visible would be great as
> > > > well.
> > >
> > > Before documenting the behavior, it would be a good idea to figure out
> > > what to do with perf_event_open(). That one's weird in that it only
> > > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > > like userspace stack and register contents (if perf_event_paranoid is
> > > 1 or 2). Maybe for SELinux things (and maybe also for Yama), there
> > > should be a level in between that allows fully inspecting the process
> > > (for purposes like profiling) but without the ability to corrupt its
> > > memory or registers or things like that. Or maybe perf_event_open()
> > > should just use the ATTACH mode.
> >
> > Thanks for the clarification. I still cannot sa

Re: [PATCH 1/1] process_madvise.2: Add process_madvise man page

2021-01-28 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 12:31 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> On 1/28/21 7:40 PM, Suren Baghdasaryan wrote:
> > On Thu, Jan 28, 2021 at 4:24 AM Michael Kerrisk (man-pages)
> >  wrote:
> >>
> >> Hello Suren,
> >>
> >> Thank you for writing this page! Some comments below.
> >
> > Thanks for the review!
> > Couple questions below and I'll respin the new version once they are 
> > clarified.
>
> Okay. See below.
>
> >> On Wed, 20 Jan 2021 at 21:36, Suren Baghdasaryan  wrote:
> >>>
>
> [...]
>
> Thanks for all the acks. That let's me know that you saw what I said.
>
> >>> RETURN VALUE
> >>> On success, process_madvise() returns the number of bytes advised. 
> >>> This
> >>> return value may be less than the total number of requested bytes, if 
> >>> an
> >>> error occurred. The caller should check return value to determine 
> >>> whether
> >>> a partial advice occurred.
> >>
> >> So there are three return values possible,
> >
> > Ok, I think I see your point. How about this instead:
>
> Well, I'm glad you saw it, because I forgot to finish it. But yes,
> you understood what I forgot to say.
>
> > RETURN VALUE
> >  On success, process_madvise() returns the number of bytes advised. This
> >  return value may be less than the total number of requested bytes, if 
> > an
> >  error occurred after some iovec elements were already processed. The 
> > caller
> >  should check the return value to determine whether a partial
> > advice occurred.
> >
> > On error, -1 is returned and errno is set appropriately.
>
> We recently standardized some wording here:
> s/appropriately/to indicate the error/.
>
>
> >>> +.PP
> >>> +The pointer
> >>> +.I iovec
> >>> +points to an array of iovec structures, defined in
> >>
> >> "iovec" should be formatted as
> >>
> >> .I iovec
> >
> > I think it is formatted that way above. What am I missing?
>
> But also in "an array of iovec structures"...
>
> > BTW, where should I be using .I vs .IR? I was looking for an answer
> > but could not find it.
>
> .B / .I == bold/italic this line
> .BR / .IR == alternate bold/italic with normal (Roman) font.
>
> So:
> .I iovec
> .I iovec ,   # so that comma is not italic
> .BR process_madvise ()
> etc.
>
> [...]
>
> >>> +.I iovec
> >>> +if one of its elements points to an invalid memory
> >>> +region in the remote process. No further elements will be
> >>> +processed beyond that point.
> >>> +.PP
> >>> +Permission to provide a hint to external process is governed by a
> >>> +ptrace access mode
> >>> +.B PTRACE_MODE_READ_REALCREDS
> >>> +check; see
> >>> +.BR ptrace (2)
> >>> +and
> >>> +.B CAP_SYS_ADMIN
> >>> +capability that caller should have in order to affect performance
> >>> +of an external process.
> >>
> >> The preceding sentence is garbled. Missing words?
> >
> > Maybe I worded it incorrectly. What I need to say here is that the
> > caller should have both PTRACE_MODE_READ_REALCREDS credentials and
> > CAP_SYS_ADMIN capability. The first part I shamelessly copy/pasted
> > from https://man7.org/linux/man-pages/man2/process_vm_readv.2.html and
> > tried adding the second one to it, obviously unsuccessfully. Any
> > advice on how to fix that?
>
> I think you already got pretty close. How about:
>
> [[
> Permission to provide a hint to another process is governed by a
> ptrace access mode
> .B PTRACE_MODE_READ_REALCREDS
> check (see
> BR ptrace (2));
> in addition, the caller must have the
> .B CAP_SYS_ADMIN
> capability.

In V2 I explanded a bit this part to explain why CAP_SYS_ADMIN is
needed. There were questions about that during my patch review which
adds this requirement
(https://lore.kernel.org/patchwork/patch/1363605), so I thought a
short explanation would be useful.

> ]]
>
> [...]
>
> >>> +.TP
> >>> +.B ESRCH
> >>> +No process with ID
> >>> +.I pidfd
> >>> +exists.
> >>
> >> Should this maybe be:
> >> [[
> >> The target process does not exist (i.e., it has terminated and
> >> been waited on).
> >> ]]
> >>
> >> See pidfd_send_signal(2).
> >
> > I "borrowed" mine from
> > https://man7.org/linux/man-pages/man2/process_vm_readv.2.html but
> > either one sounds good to me. Maybe for pidfd_send_signal the wording
> > about termination is more important. Anyway, it's up to you. Just let
> > me know which one to use.
>
> I think the pidfd_send_signal(2) wording fits better.
>
> [...]
>
> Thanks,
>
> Michael
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-01-29 Thread Suren Baghdasaryan

On Fri, Jan 29, 2021 at 1:13 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Thu 28-01-21 23:03:40, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/20210111170622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
>
> Reviewed-by: Michal Hocko 

Thanks!

> Thanks!
>
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Appled fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of other process as well as of
> > the calling process. It provides the advice to address ranges of process
> > described by iovec and vlen. The goal of such advice is to improve 
> > system
> > or application performance.
> >
> > The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
> > Deactive a given range of pages which will make them a more probable
> > reclaim target should there be a memory pressure. This is a non-
> > destructive operation. The advice might be ignored for some pages in
> > the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
> >
> > The flags argument is reserved for future use; currently, this argument
> > must be specified as 0.
> >
> > The value specified in the vlen argument must be less than or equal to
> > IOV_MAX (defined in  or accessible via the call
> > sysconf(_SC_IOV_MAX)).
> >
> > The vlen and iovec arguments are checked before applying any hints. If
> > the vlen is too big, or iovec is invalid, an error will be returned
> > immediately.
> >
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory region in the remote process. No further elements 
> > will
> > be processed beyond that point.
> >
> > Permission to provide a hint to another process is governed by a ptrace
> > access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in 
> > addition,
> > the caller must have the CAP_SYS_ADMIN capability due to performance
> > implications of applying the hint.
> >
> > RETURN VALUE
> > On success, process_madvise(

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-01 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:00 AM Suren Baghdasaryan  wrote:
>
> On Thu, Jan 28, 2021 at 10:19 AM Minchan Kim  wrote:
> >
> > On Thu, Jan 28, 2021 at 09:52:59AM -0800, Suren Baghdasaryan wrote:
> > > On Thu, Jan 28, 2021 at 1:13 AM Christoph Hellwig  
> > > wrote:
> > > >
> > > > On Thu, Jan 28, 2021 at 12:38:17AM -0800, Suren Baghdasaryan wrote:
> > > > > Currently system heap maps its buffers with VM_PFNMAP flag using
> > > > > remap_pfn_range. This results in such buffers not being accounted
> > > > > for in PSS calculations because vm treats this memory as having no
> > > > > page structs. Without page structs there are no counters representing
> > > > > how many processes are mapping a page and therefore PSS calculation
> > > > > is impossible.
> > > > > Historically, ION driver used to map its buffers as VM_PFNMAP areas
> > > > > due to memory carveouts that did not have page structs [1]. That
> > > > > is not the case anymore and it seems there was desire to move away
> > > > > from remap_pfn_range [2].
> > > > > Dmabuf system heap design inherits this ION behavior and maps its
> > > > > pages using remap_pfn_range even though allocated pages are backed
> > > > > by page structs.
> > > > > Clear VM_IO and VM_PFNMAP flags when mapping memory allocated by the
> > > > > system heap and replace remap_pfn_range with vm_insert_page, following
> > > > > Laura's suggestion in [1]. This would allow correct PSS calculation
> > > > > for dmabufs.
> > > > >
> > > > > [1] 
> > > > > https://driverdev-devel.linuxdriverproject.narkive.com/v0fJGpaD/using-ion-memory-for-direct-io
> > > > > [2] 
> > > > > http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2018-October/127519.html
> > > > > (sorry, could not find lore links for these discussions)
> > > > >
> > > > > Suggested-by: Laura Abbott 
> > > > > Signed-off-by: Suren Baghdasaryan 
> > > > > ---
> > > > >  drivers/dma-buf/heaps/system_heap.c | 6 --
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/dma-buf/heaps/system_heap.c 
> > > > > b/drivers/dma-buf/heaps/system_heap.c
> > > > > index 17e0e9a68baf..0e92e42b2251 100644
> > > > > --- a/drivers/dma-buf/heaps/system_heap.c
> > > > > +++ b/drivers/dma-buf/heaps/system_heap.c
> > > > > @@ -200,11 +200,13 @@ static int system_heap_mmap(struct dma_buf 
> > > > > *dmabuf, struct vm_area_struct *vma)
> > > > >   struct sg_page_iter piter;
> > > > >   int ret;
> > > > >
> > > > > + /* All pages are backed by a "struct page" */
> > > > > + vma->vm_flags &= ~VM_PFNMAP;
> > > >
> > > > Why do we clear this flag?  It shouldn't even be set here as far as I
> > > > can tell.
> > >
> > > Thanks for the question, Christoph.
> > > I tracked down that flag being set by drm_gem_mmap_obj() which DRM
> > > drivers use to "Set up the VMA to prepare mapping of the GEM object"
> > > (according to drm_gem_mmap_obj comments). I also see a pattern in
> > > several DMR drivers to call drm_gem_mmap_obj()/drm_gem_mmap(), then
> > > clear VM_PFNMAP and then map the VMA (for example here:
> > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/rockchip/rockchip_drm_gem.c#L246).
> > > I thought that dmabuf allocator (in this case the system heap) would
> > > be the right place to set these flags because it controls how memory
> > > is allocated before mapping. However it's quite possible that I'm
> >
> > However, you're not setting but removing a flag under the caller.
> > It's different with appending more flags(e.g., removing condition
> > vs adding more conditions). If we should remove the flag, caller
> > didn't need to set it from the beginning. Hiding it under this API
> > continue to make wrong usecase in future.
>
> Which takes us back to the question of why VM_PFNMAP is being set by
> the caller in the first place.
>
> >
> > > missing the real reason for VM_PFNMAP being set in drm_gem_mmap_obj()
> > > before dma_buf_mmap() is called. I could not find the answer to that,
> > > so I hope so

Re: [PATCH v2 1/1] process_madvise.2: Add process_madvise man page

2021-02-01 Thread Suren Baghdasaryan

On Sat, Jan 30, 2021 at 1:34 PM Michael Kerrisk (man-pages)
 wrote:
>
> Hello Suren,
>
> Thank you for the revisions! Just a few more comments: all pretty small
> stuff (many points that I overlooked the first time rround), since the
> page already looks pretty good by now.
>
> Again, thanks for the rendered version. As before, I've added my
> comments to the page source.

Hi Michael,
Thanks for reviewing!

>
> On 1/29/21 8:03 AM, Suren Baghdasaryan wrote:
> > Initial version of process_madvise(2) manual page. Initial text was
> > extracted from [1], amended after fix [2] and more details added using
> > man pages of madvise(2) and process_vm_read(2) as examples. It also
> > includes the changes to required permission proposed in [3].
> >
> > [1] https://lore.kernel.org/patchwork/patch/1297933/
> > [2] https://lkml.org/lkml/2020/12/8/1282
> > [3] 
> > https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> > changes in v2:
> > - Changed description of MADV_COLD per Michal Hocko's suggestion
> > - Appled fixes suggested by Michael Kerrisk
> >
> > NAME
> > process_madvise - give advice about use of memory to a process
>
> s/-/\-/

ack

>
> >
> > SYNOPSIS
> > #include 
> >
> > ssize_t process_madvise(int pidfd,
> >const struct iovec *iovec,
> >unsigned long vlen,
> >int advice,
> >unsigned int flags);
> >
> > DESCRIPTION
> > The process_madvise() system call is used to give advice or directions
> > to the kernel about the address ranges of other process as well as of
> > the calling process. It provides the advice to address ranges of process
> > described by iovec and vlen. The goal of such advice is to improve 
> > system
> > or application performance.
> >
> > The pidfd argument is a PID file descriptor (see pidofd_open(2)) that
> > specifies the process to which the advice is to be applied.
> >
> > The pointer iovec points to an array of iovec structures, defined in
> >  as:
> >
> > struct iovec {
> > void  *iov_base;/* Starting address */
> > size_t iov_len; /* Number of bytes to transfer */
> > };
> >
> > The iovec structure describes address ranges beginning at iov_base 
> > address
> > and with the size of iov_len bytes.
> >
> > The vlen represents the number of elements in the iovec structure.
> >
> > The advice argument is one of the values listed below.
> >
> >   Linux-specific advice values
> > The following Linux-specific advice values have no counterparts in the
> > POSIX-specified posix_madvise(3), and may or may not have counterparts
> > in the madvise(2) interface available on other implementations.
> >
> > MADV_COLD (since Linux 5.4.1)
> > Deactive a given range of pages which will make them a more probable
> > reclaim target should there be a memory pressure. This is a non-
> > destructive operation. The advice might be ignored for some pages in
> > the range when it is not applicable.
> >
> > MADV_PAGEOUT (since Linux 5.4.1)
> > Reclaim a given range of pages. This is done to free up memory 
> > occupied
> > by these pages. If a page is anonymous it will be swapped out. If a
> > page is file-backed and dirty it will be written back to the backing
> > storage. The advice might be ignored for some pages in the range 
> > when
> > it is not applicable.
> >
> > The flags argument is reserved for future use; currently, this argument
> > must be specified as 0.
> >
> > The value specified in the vlen argument must be less than or equal to
> > IOV_MAX (defined in  or accessible via the call
> > sysconf(_SC_IOV_MAX)).
> >
> > The vlen and iovec arguments are checked before applying any hints. If
> > the vlen is too big, or iovec is invalid, an error will be returned
> > immediately.
> >
> > The hint might be applied to a part of iovec if one of its elements 
> > points
> > to an invalid memory region in the remote process. No further elements 
> > will
> > be processed beyond that point.
> >
> > Permission to provide a hint to another process is governed by

[PATCH v3 1/1] process_madvise.2: Add process_madvise man page

2021-02-01 Thread Suren Baghdasaryan

Initial version of process_madvise(2) manual page. Initial text was
extracted from [1], amended after fix [2] and more details added using
man pages of madvise(2) and process_vm_read(2) as examples. It also
includes the changes to required permission proposed in [3].

[1] https://lore.kernel.org/patchwork/patch/1297933/
[2] https://lkml.org/lkml/2020/12/8/1282
[3] 
https://patchwork.kernel.org/project/selinux/patch/2021070622.2613577-1-sur...@google.com/#23888311

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Michal Hocko 
---
changes in v2:
- Changed description of MADV_COLD per Michal Hocko's suggestion
- Applied fixes suggested by Michael Kerrisk
changes in v3:
- Added Michal's Reviewed-by
- Applied additional fixes suggested by Michael Kerrisk

NAME
process_madvise - give advice about use of memory to a process

SYNOPSIS
#include 

ssize_t process_madvise(int pidfd,
   const struct iovec *iovec,
   unsigned long vlen,
   int advice,
   unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges of another process or the calling
process. It provides the advice to the address ranges described by iovec
and vlen. The goal of such advice is to improve system or application
performance.

The pidfd argument is a PID file descriptor (see pidfd_open(2)) that
specifies the process to which the advice is to be applied.

The pointer iovec points to an array of iovec structures, defined in
 as:

struct iovec {
void  *iov_base;/* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

The iovec structure describes address ranges beginning at iov_base address
and with the size of iov_len bytes.

The vlen represents the number of elements in the iovec structure.

The advice argument is one of the values listed below.

  Linux-specific advice values
The following Linux-specific advice values have no counterparts in the
POSIX-specified posix_madvise(3), and may or may not have counterparts
in the madvise(2) interface available on other implementations.

MADV_COLD (since Linux 5.4.1)
Deactive a given range of pages which will make them a more probable
reclaim target should there be a memory pressure. This is a
nondestructive operation. The advice might be ignored for some pages
in the range when it is not applicable.

MADV_PAGEOUT (since Linux 5.4.1)
Reclaim a given range of pages. This is done to free up memory occupied
by these pages. If a page is anonymous it will be swapped out. If a
page is file-backed and dirty it will be written back to the backing
storage. The advice might be ignored for some pages in the range when
it is not applicable.

The flags argument is reserved for future use; currently, this argument
must be specified as 0.

The value specified in the vlen argument must be less than or equal to
IOV_MAX (defined in  or accessible via the call
sysconf(_SC_IOV_MAX)).

The vlen and iovec arguments are checked before applying any hints. If
the vlen is too big, or iovec is invalid, an error will be returned
immediately and no advice will be applied.

The hint might be applied to a part of iovec if one of its elements points
to an invalid memory region in the remote process. No further elements will
be processed beyond that point.

Permission to provide a hint to another process is governed by a ptrace
access mode PTRACE_MODE_READ_REALCREDS check (see ptrace(2)); in addition,
the caller must have the CAP_SYS_ADMIN capability due to performance
implications of applying the hint.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised. This
return value may be less than the total number of requested bytes, if an
error occurred after some iovec elements were already processed. The caller
should check the return value to determine whether a partial advice
occurred.

On error, -1 is returned and errno is set to indicate the error.

ERRORS
EBADF pidfd is not a valid PID file descriptor.
EFAULT The memory described by iovec is outside the accessible address
   space of the process referred to by pidfd.
EINVAL flags is not 0.
EINVAL The sum of the iov_len values of iovec overflows a ssize_t value.
EINVAL vlen is too large.
ENOMEM Could not allocate memory for internal copies of the iovec
   structures.
EPERM The caller does not have permission to access the address space of
  the process pidfd.
ESRCH The target process does not exist (i.e., it has terminated and been
  waited on).

VERSIONS
This system call first appeared in Linux 5.10. Support for t

Re: [PATCH v2 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-02-01 Thread Suren Baghdasaryan

On Thu, Jan 28, 2021 at 11:08 PM Suren Baghdasaryan  wrote:
>
> On Thu, Jan 28, 2021 at 11:51 AM Suren Baghdasaryan  wrote:
> >
> > On Tue, Jan 26, 2021 at 5:52 AM 'Michal Hocko' via kernel-team
> >  wrote:
> > >
> > > On Wed 20-01-21 14:17:39, Jann Horn wrote:
> > > > On Wed, Jan 13, 2021 at 3:22 PM Michal Hocko  wrote:
> > > > > On Tue 12-01-21 09:51:24, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 12, 2021 at 9:45 AM Oleg Nesterov  
> > > > > > wrote:
> > > > > > >
> > > > > > > On 01/12, Michal Hocko wrote:
> > > > > > > >
> > > > > > > > On Mon 11-01-21 09:06:22, Suren Baghdasaryan wrote:
> > > > > > > >
> > > > > > > > > What we want is the ability for one process to influence 
> > > > > > > > > another process
> > > > > > > > > in order to optimize performance across the entire system 
> > > > > > > > > while leaving
> > > > > > > > > the security boundary intact.
> > > > > > > > > Replace PTRACE_MODE_ATTACH with a combination of 
> > > > > > > > > PTRACE_MODE_READ
> > > > > > > > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR 
> > > > > > > > > metadata
> > > > > > > > > and CAP_SYS_NICE for influencing process performance.
> > > > > > > >
> > > > > > > > I have to say that ptrace modes are rather obscure to me. So I 
> > > > > > > > cannot
> > > > > > > > really judge whether MODE_READ is sufficient. My understanding 
> > > > > > > > has
> > > > > > > > always been that this is requred to RO access to the address 
> > > > > > > > space. But
> > > > > > > > this operation clearly has a visible side effect. Do we have 
> > > > > > > > any actual
> > > > > > > > documentation for the existing modes?
> > > > > > > >
> > > > > > > > I would be really curious to hear from Jann and Oleg (now Cced).
> > > > > > >
> > > > > > > Can't comment, sorry. I never understood these security checks 
> > > > > > > and never tried.
> > > > > > > IIUC only selinux/etc can treat ATTACH/READ differently and I 
> > > > > > > have no idea what
> > > > > > > is the difference.
> > > >
> > > > Yama in particular only does its checks on ATTACH and ignores READ,
> > > > that's the difference you're probably most likely to encounter on a
> > > > normal desktop system, since some distros turn Yama on by default.
> > > > Basically the idea there is that running "gdb -p $pid" or "strace -p
> > > > $pid" as a normal user will usually fail, but reading /proc/$pid/maps
> > > > still works; so you can see things like detailed memory usage
> > > > information and such, but you're not supposed to be able to directly
> > > > peek into a running SSH client and inject data into the existing SSH
> > > > connection, or steal the cryptographic keys for the current
> > > > connection, or something like that.
> > > >
> > > > > > I haven't seen a written explanation on ptrace modes but when I
> > > > > > consulted Jann his explanation was:
> > > > > >
> > > > > > PTRACE_MODE_READ means you can inspect metadata about processes with
> > > > > > the specified domain, across UID boundaries.
> > > > > > PTRACE_MODE_ATTACH means you can fully impersonate processes with 
> > > > > > the
> > > > > > specified domain, across UID boundaries.
> > > > >
> > > > > Maybe this would be a good start to document expectations. Some more
> > > > > practical examples where the difference is visible would be great as
> > > > > well.
> > > >
> > > > Before documenting the behavior, it would be a good idea to figure out
> > > > what to do with perf_event_open(). That one's weird in that it only
> > > > requires PTRACE_MODE_READ, but actually allows you to sample stuff
> > > > like userspace stack and

Re: [PATCH 1/1] dma-buf: heaps: Map system heap pages as managed by linux vm

2021-02-02 Thread Suren Baghdasaryan

On Mon, Feb 1, 2021 at 11:03 PM Christoph Hellwig  wrote:
>
> IMHO the
>
> BUG_ON(vma->vm_flags & VM_PFNMAP);
>
> in vm_insert_page should just become a WARN_ON_ONCE with an error
> return, and then we just need to gradually fix up the callers that
> trigger it instead of coming up with workarounds like this.

For the existing vm_insert_page users this should be fine since
BUG_ON() guarantees that none of them sets VM_PFNMAP. However, for the
system_heap_mmap I have one concern. When vm_insert_page returns an
error due to VM_PFNMAP flag, the whole mmap operation should fail
(system_heap_mmap returning an error leading to dma_buf_mmap failure).
Could there be cases when a heap user (DRM driver for example) would
be expected to work with a heap which requires VM_PFNMAP and at the
same time with another heap which requires !VM_PFNMAP? IOW, this
introduces a dependency between the heap and its
user. The user would have to know expectations of the heap it uses and
can't work with another heap that has the opposite expectation. This
usecase is purely theoretical and maybe I should not worry about it
for now?

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 3:14 PM John Hubbard  wrote:
>
> On 2/4/21 12:07 PM, Minchan Kim wrote:
> > On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote:
> >> On 2/3/21 7:50 AM, Minchan Kim wrote:
> >>> Since CMA is getting used more widely, it's more important to
> >>> keep monitoring CMA statistics for system health since it's
> >>> directly related to user experience.
> >>>
> >>> This patch introduces sysfs for the CMA and exposes stats below
> >>> to keep monitor for telemetric in the system.
> >>>
> >>>* the number of CMA allocation attempts
> >>>* the number of CMA allocation failures
> >>>* the number of CMA page allocation attempts
> >>>* the number of CMA page allocation failures
> >>
> >> The desire to report CMA data is understandable, but there are a few
> >> odd things here:
> >>
> >> 1) First of all, this has significant overlap with /sys/kernel/debug/cma
> >> items. I suspect that all of these items could instead go into
> >
> > At this moment, I don't see any overlap with item from cma_debugfs.
> > Could you specify what item you are mentioning?
>
> Just the fact that there would be two systems under /sys, both of which are
> doing very very similar things: providing information that is intended to
> help diagnose CMA.
>
> >
> >> /sys/kernel/debug/cma, right?
> >
> > Anyway, thing is I need an stable interface for that and need to use
> > it in Android production build, too(Unfortunately, Android deprecated
> > the debugfs
> > https://source.android.com/setup/start/android-11-release#debugfs
> > )
>
> That's the closest hint to a "why this is needed" that we've seen yet.
> But it's only a hint.
>
> >
> > What should be in debugfs and in sysfs? What's the criteria?
>
> Well, it's a gray area. "Debugging support" goes into debugfs, and
> "production-level monitoring and control" goes into sysfs, roughly
> speaking. And here you have items that could be classified as either.
>
> >
> > Some statistic could be considered about debugging aid or telemetric
> > depening on view point and usecase. And here, I want to use it for
> > telemetric, get an stable interface and use it in production build
> > of Android. In this chance, I'd like to get concrete guideline
> > what should be in sysfs and debugfs so that pointing out this thread
> > whenever folks dump their stat into sysfs to avoid waste of time
> > for others in future. :)
> >
> >>
> >> 2) The overall CMA allocation attempts/failures (first two items above) 
> >> seem
> >> an odd pair of things to track. Maybe that is what was easy to track, but 
> >> I'd
> >> vote for just omitting them.
> >
> > Then, how to know how often CMA API failed?
>
> Why would you even need to know that, *in addition* to knowing specific
> page allocation numbers that failed? Again, there is no real-world motivation
> cited yet, just "this is good data". Need more stories and support here.

IMHO it would be very useful to see whether there are multiple
small-order allocation failures or a few large-order ones, especially
for CMA where large allocations are not unusual. For that I believe
both alloc_pages_attempt and alloc_pages_fail would be required.

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > There are various size allocation request for a CMA so only page
> > allocation stat are not enough to know it.
> >
> >>>
> >>> Signed-off-by: Minchan Kim 
> >>> ---
> >>>Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
> >>>include/linux/cma.h   |   1 +
> >>>mm/Makefile   |   1 +
> >>>mm/cma.c  |   6 +-
> >>>mm/cma.h  |  20 +++
> >>>mm/cma_sysfs.c| 143 ++
> >>>6 files changed, 209 insertions(+), 1 deletion(-)
> >>>create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>>create mode 100644 mm/cma_sysfs.c
> >>>
> >>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
> >>> b/Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>> new file mode 100644
> >>> index ..2a43c0aacc39
> >>> --- /dev/null
> >>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
> >>> @@ -0,0 +1,39 @@
> >>> +What:  /sys/kernel/mm/cma/
> >>> +Date:  Feb 2021
> >>> +Contact:   Minchan Kim 
> >>> +Description:
> >>> +   /sys/kernel/mm/cma/ contains a number of subdirectories by
> >>> +   cma-heap name. The subdirectory contains a number of files
> >>> +   to represent cma allocation statistics.
> >>
> >> Somewhere, maybe here, there should be a mention of the closely related
> >> /sys/kernel/debug/cma files.
> >>
> >>> +
> >>> +   There are number of files under
> >>> +   /sys/kernel/mm/cma/ directory
> >>> +
> >>> +   - cma_alloc_attempt
> >>> +   - cma_alloc_fail
> >>
> >> Are these really useful? They a summary of th

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 3:43 PM Suren Baghdasaryan  wrote:
>
> On Thu, Feb 4, 2021 at 3:14 PM John Hubbard  wrote:
> >
> > On 2/4/21 12:07 PM, Minchan Kim wrote:
> > > On Thu, Feb 04, 2021 at 12:50:58AM -0800, John Hubbard wrote:
> > >> On 2/3/21 7:50 AM, Minchan Kim wrote:
> > >>> Since CMA is getting used more widely, it's more important to
> > >>> keep monitoring CMA statistics for system health since it's
> > >>> directly related to user experience.
> > >>>
> > >>> This patch introduces sysfs for the CMA and exposes stats below
> > >>> to keep monitor for telemetric in the system.
> > >>>
> > >>>* the number of CMA allocation attempts
> > >>>* the number of CMA allocation failures
> > >>>* the number of CMA page allocation attempts
> > >>>* the number of CMA page allocation failures
> > >>
> > >> The desire to report CMA data is understandable, but there are a few
> > >> odd things here:
> > >>
> > >> 1) First of all, this has significant overlap with /sys/kernel/debug/cma
> > >> items. I suspect that all of these items could instead go into
> > >
> > > At this moment, I don't see any overlap with item from cma_debugfs.
> > > Could you specify what item you are mentioning?
> >
> > Just the fact that there would be two systems under /sys, both of which are
> > doing very very similar things: providing information that is intended to
> > help diagnose CMA.
> >
> > >
> > >> /sys/kernel/debug/cma, right?
> > >
> > > Anyway, thing is I need an stable interface for that and need to use
> > > it in Android production build, too(Unfortunately, Android deprecated
> > > the debugfs
> > > https://source.android.com/setup/start/android-11-release#debugfs
> > > )
> >
> > That's the closest hint to a "why this is needed" that we've seen yet.
> > But it's only a hint.
> >
> > >
> > > What should be in debugfs and in sysfs? What's the criteria?
> >
> > Well, it's a gray area. "Debugging support" goes into debugfs, and
> > "production-level monitoring and control" goes into sysfs, roughly
> > speaking. And here you have items that could be classified as either.
> >
> > >
> > > Some statistic could be considered about debugging aid or telemetric
> > > depening on view point and usecase. And here, I want to use it for
> > > telemetric, get an stable interface and use it in production build
> > > of Android. In this chance, I'd like to get concrete guideline
> > > what should be in sysfs and debugfs so that pointing out this thread
> > > whenever folks dump their stat into sysfs to avoid waste of time
> > > for others in future. :)
> > >
> > >>
> > >> 2) The overall CMA allocation attempts/failures (first two items above) 
> > >> seem
> > >> an odd pair of things to track. Maybe that is what was easy to track, 
> > >> but I'd
> > >> vote for just omitting them.
> > >
> > > Then, how to know how often CMA API failed?
> >
> > Why would you even need to know that, *in addition* to knowing specific
> > page allocation numbers that failed? Again, there is no real-world 
> > motivation
> > cited yet, just "this is good data". Need more stories and support here.
>
> IMHO it would be very useful to see whether there are multiple
> small-order allocation failures or a few large-order ones, especially
> for CMA where large allocations are not unusual. For that I believe
> both alloc_pages_attempt and alloc_pages_fail would be required.

Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
be required".

>
> >
> >
> > thanks,
> > --
> > John Hubbard
> > NVIDIA
> >
> > > There are various size allocation request for a CMA so only page
> > > allocation stat are not enough to know it.
> > >
> > >>>
> > >>> Signed-off-by: Minchan Kim 
> > >>> ---
> > >>>Documentation/ABI/testing/sysfs-kernel-mm-cma |  39 +
> > >>>include/linux/cma.h   |   1 +
> > >>>mm/Makefile   |   1 +
> > >>>mm/cma.c  |   6 +-
> > >>>mm/cma

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 4:34 PM John Hubbard  wrote:
>
> On 2/4/21 4:25 PM, John Hubbard wrote:
> > On 2/4/21 3:45 PM, Suren Baghdasaryan wrote:
> > ...
> >>>>>> 2) The overall CMA allocation attempts/failures (first two items 
> >>>>>> above) seem
> >>>>>> an odd pair of things to track. Maybe that is what was easy to track, 
> >>>>>> but I'd
> >>>>>> vote for just omitting them.
> >>>>>
> >>>>> Then, how to know how often CMA API failed?
> >>>>
> >>>> Why would you even need to know that, *in addition* to knowing specific
> >>>> page allocation numbers that failed? Again, there is no real-world 
> >>>> motivation
> >>>> cited yet, just "this is good data". Need more stories and support here.
> >>>
> >>> IMHO it would be very useful to see whether there are multiple
> >>> small-order allocation failures or a few large-order ones, especially
> >>> for CMA where large allocations are not unusual. For that I believe
> >>> both alloc_pages_attempt and alloc_pages_fail would be required.
> >>
> >> Sorry, I meant to say "both cma_alloc_fail and alloc_pages_fail would
> >> be required".
> >
> > So if you want to know that, the existing items are still a little too 
> > indirect
> > to really get it right. You can only know the average allocation size, by
> > dividing. Instead, we should provide the allocation size, for each count.
> >
> > The limited interface makes this a little awkward, but using zones/ranges 
> > could
> > work: "for this range of allocation sizes, there were the following stats". 
> > Or,
> > some other technique that I haven't thought of (maybe two items per file?) 
> > would
> > be better.
> >
> > On the other hand, there's an argument for keeping this minimal and simple. 
> > That
> > would probably lead us to putting in a couple of items into /proc/vmstat, 
> > as I
> > just mentioned in my other response, and calling it good.

True. I was thinking along these lines but per-order counters felt
like maybe an overkill? I'm all for keeping it simple.

>

> ...and remember: if we keep it nice and minimal and clean, we can put it into
> /proc/vmstat and monitor it.

No objections from me.

>
> And then if a problem shows up, the more complex and advanced debugging data 
> can
> go into debugfs's CMA area. And you're all set.
>
> If Android made up some policy not to use debugfs, then:
>
> a) that probably won't prevent engineers from using it anyway, for advanced 
> debugging,
> and
>
> b) If (a) somehow falls short, then we need to talk about what Android's 
> plans are to
> fill the need. And "fill up sysfs with debugfs items, possibly duplicating 
> some of them,
> and generally making an unecessary mess, to compensate for not using debugfs" 
> is not
> my first choice. :)
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

Re: [PATCH] mm: cma: support sysfs

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 5:44 PM Minchan Kim  wrote:
>
> On Thu, Feb 04, 2021 at 04:24:20PM -0800, John Hubbard wrote:
> > On 2/4/21 4:12 PM, Minchan Kim wrote:
> > ...
> > > > > Then, how to know how often CMA API failed?
> > > >
> > > > Why would you even need to know that, *in addition* to knowing specific
> > > > page allocation numbers that failed? Again, there is no real-world 
> > > > motivation
> > > > cited yet, just "this is good data". Need more stories and support here.
> > >
> > > Let me give an example.
> > >
> > > Let' assume we use memory buffer allocation via CMA for bluetooth
> > > enable of  device.
> > > If user clicks the bluetooth button in the phone but fail to allocate
> > > the memory from CMA, user will still see bluetooth button gray.
> > > User would think his touch was not enough powerful so he try clicking
> > > again and fortunately CMA allocation was successful this time and
> > > they will see bluetooh button enabled and could listen the music.
> > >
> > > Here, product team needs to monitor how often CMA alloc failed so
> > > if the failure ratio is steadily increased than the bar,
> > > it means engineers need to go investigation.
> > >
> > > Make sense?
> > >
> >
> > Yes, except that it raises more questions:
> >
> > 1) Isn't this just standard allocation failure? Don't you already have a way
> > to track that?
> >
> > Presumably, having the source code, you can easily deduce that a bluetooth
> > allocation failure goes directly to a CMA allocation failure, right?
> >
> > Anyway, even though the above is still a little murky, I expect you're right
> > that it's good to have *some* indication, somewhere about CMA behavior...
> >
> > Thinking about this some more, I wonder if this is really /proc/vmstat sort
> > of data that we're talking about. It seems to fit right in there, yes?
>
> Thing is CMA instance are multiple, cma-A, cma-B, cma-C and each of CMA
> heap has own specific scenario. /proc/vmstat could be bloated a lot
> while CMA instance will be increased.

Oh, I missed the fact that you need these stats per-CMA.

Re: [Linaro-mm-sig] [PATCH 1/2] mm: replace BUG_ON in vm_insert_page with a return of an error

2021-02-04 Thread Suren Baghdasaryan

On Thu, Feb 4, 2021 at 7:55 AM Alex Deucher  wrote:
>
> On Thu, Feb 4, 2021 at 3:16 AM Christian König  
> wrote:
> >
> > Am 03.02.21 um 22:41 schrieb Suren Baghdasaryan:
> > > [SNIP]
> > >>> How many semi-unrelated buffer accounting schemes does google come up 
> > >>> with?
> > >>>
> > >>> We're at three with this one.
> > >>>
> > >>> And also we _cannot_ required that all dma-bufs are backed by struct
> > >>> page, so requiring struct page to make this work is a no-go.
> > >>>
> > >>> Second, we do not want to all get_user_pages and friends to work on
> > >>> dma-buf, it causes all kinds of pain. Yes on SoC where dma-buf are
> > >>> exclusively in system memory you can maybe get away with this, but
> > >>> dma-buf is supposed to work in more places than just Android SoCs.
> > >> I just realized that vm_inser_page doesn't even work for CMA, it would
> > >> upset get_user_pages pretty badly - you're trying to pin a page in
> > >> ZONE_MOVEABLE but you can't move it because it's rather special.
> > >> VM_SPECIAL is exactly meant to catch this stuff.
> > > Thanks for the input, Daniel! Let me think about the cases you pointed 
> > > out.
> > >
> > > IMHO, the issue with PSS is the difficulty of calculating this metric
> > > without struct page usage. I don't think that problem becomes easier
> > > if we use cgroups or any other API. I wanted to enable existing PSS
> > > calculation mechanisms for the dmabufs known to be backed by struct
> > > pages (since we know how the heap allocated that memory), but sounds
> > > like this would lead to problems that I did not consider.
> >
> > Yeah, using struct page indeed won't work. We discussed that multiple
> > times now and Daniel even has a patch to mangle the struct page pointers
> > inside the sg_table object to prevent abuse in that direction.
> >
> > On the other hand I totally agree that we need to do something on this
> > side which goes beyong what cgroups provide.
> >
> > A few years ago I came up with patches to improve the OOM killer to
> > include resources bound to the processes through file descriptors. I
> > unfortunately can't find them of hand any more and I'm currently to busy
> > to dig them up.
>
> https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html
> I think there was a more recent discussion, but I can't seem to find it.

Thanks for the pointer!
Appreciate the time everyone took to explain the issues.
Thanks,
Suren.

>
> Alex
>
> >
> > In general I think we need to make it possible that both the in kernel
> > OOM killer as well as userspace processes and handlers have access to
> > that kind of data.
> >
> > The fdinfo approach as suggested in the other thread sounds like the
> > easiest solution to me.
> >
> > Regards,
> > Christian.
> >
> > > Thanks,
> > > Suren.
> > >
> > >
> >
> > ___
> > dri-devel mailing list
> > dri-de...@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel

1 2 3 4 5 >

1 - 100 of 455 matches

Mail list logo