Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Konstantin Khlebnikov

On 23.07.2019 0:32, Joel Fernandes (Google) wrote:

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
This is quite cumbersome and can be error-prone too. If between
accessing the per-PID pagemap and the global page_idle bitmap, if
something changes with the page then the information is not accurate.
More over looking up PFN from pagemap in Android devices is not
supported by unprivileged process and requires SYS_ADMIN and gives 0 for
the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc//page_idle file. This
eliminates the need for userspace to calculate the mapping of the page.
It follows the exact same semantics as the global
/sys/kernel/mm/page_idle, however it is easier to use for some usecases
where looking up PFN is not needed and also does not require SYS_ADMIN.
It ended up simplifying userspace code, solving the security issue
mentioned and works quite well. SELinux does not need to be turned off
since no pagemap look up is needed.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time.

Documentation material:
The idle page tracking API for virtual address indexing using virtual page
frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
except that it uses virtual instead of physical frame numbers.

This idle page tracking API can be simpler to use than physical address
indexing, since the pagemap for a process does not need to be looked up
to mark or read a page's idle bit. It is also more accurate than
physical address indexing since in physical address indexing, address
space changes can occur between reading the pagemap and reading the
bitmap. In virtual address indexing, the process's mmap_sem is held for
the duration of the access.


Maybe integrate this into existing interface: /proc/pid/clear_refs and
/proc/pid/pagemap ?

I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
marks pages idle only for pages mapped in this process.
And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
This is faster - we don't need to walk whole rmap for that.



Cc: vdavydov@gmail.com
Cc: Brendan Gregg 
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) 

---
Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

  fs/proc/base.c|   3 +
  fs/proc/internal.h|   1 +
  fs/proc/task_mmu.c|  57 +++
  include/linux/page_idle.h |   4 +
  mm/page_idle.c| 305 +-
  5 files changed, 330 insertions(+), 40 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 77eb628ecc7f..a58dd74606e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("smaps",  S_IRUGO, proc_pid_smaps_operations),
REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
REG("pagemap",S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+   REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
  #endif
  #ifdef CONFIG_SECURITY
DIR("attr",   S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, 
proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations 
proc_pid_smaps_operations;
  extern const struct file_operations proc_pid_smaps_rollup_operations;
  extern const struct file_operations proc_clear_refs_operations;
  extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
  
  extern unsigned long task_vsize(struct mm_struct *);

  extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4d2b860dbc3f..11ccc53da38e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
.open   = pagemap_open,
.release= pagemap_release,
  };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   int ret;
+   struct task_struct *tsk = get_proc_task(file_inode(file));
+
+   if (!tsk)
+   return -EINVAL;
+   ret = page_idle_proc_read(file, buf, count, ppos, tsk);
+   put_task_struct(tsk);
+   return ret;
+}
+
+static ssize_t proc_page_idle_write(struct file *file, const char __user *buf,
+size_t coun

[PATCH v3 1/2] Documentation: perf: Update documentation for ThunderX2 PMU uncore driver

2019-07-23 Thread Ganapatrao Kulkarni
From: Ganapatrao Kulkarni 

Add documentation for Cavium Coherent Processor Interconnect (CCPI2) PMU.

Signed-off-by: Ganapatrao Kulkarni 
---
 .../admin-guide/perf/thunderx2-pmu.rst| 20 ++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst 
b/Documentation/admin-guide/perf/thunderx2-pmu.rst
index 08e33675853a..01f158238ae1 100644
--- a/Documentation/admin-guide/perf/thunderx2-pmu.rst
+++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst
@@ -3,24 +3,26 @@ Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
 =
 
 The ThunderX2 SoC PMU consists of independent, system-wide, per-socket
-PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC).
+PMUs such as the Level 3 Cache (L3C), DDR4 Memory Controller (DMC) and
+Cavium Coherent Processor Interconnect (CCPI2).
 
 The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles.
 Events are counted for the default channel (i.e. channel 0) and prorated
 to the total number of channels/tiles.
 
-The DMC and L3C support up to 4 counters. Counters are independently
-programmable and can be started and stopped individually. Each counter
-can be set to a different event. Counters are 32-bit and do not support
-an overflow interrupt; they are read every 2 seconds.
+The DMC and L3C support up to 4 counters, while the CCPI2 supports up to 8
+counters. Counters are independently programmable to different events and
+can be started and stopped individually. None of the counters support an
+overflow interrupt. DMC and L3C counters are 32-bit and read every 2 seconds.
+The CCPI2 counters are 64-bit and assumed not to overflow in normal operation.
 
 PMU UNCORE (perf) driver:
 
 The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and
-L3C devices.  Each PMU can be used to count up to 4 events
-simultaneously. The PMUs provide a description of their available events
-and configuration options under sysfs, see
-/sys/devices/uncore_; S is the socket id.
+L3C devices.  Each PMU can be used to count up to 4 (DMC/L3C) or up to 8
+(CCPI2) events simultaneously. The PMUs provide a description of their
+available events and configuration options under sysfs, see
+/sys/devices/uncore_; S is the socket id.
 
 The driver does not support sampling, therefore "perf record" will not
 work. Per-task perf sessions are also not supported.
-- 
2.17.1



[PATCH v3 2/2] drivers/perf: Add CCPI2 PMU support in ThunderX2 UNCORE driver.

2019-07-23 Thread Ganapatrao Kulkarni
CCPI2 is a low-latency high-bandwidth serial interface for connecting
ThunderX2 processors. This patch adds support to capture CCPI2 perf events.

Signed-off-by: Ganapatrao Kulkarni 
---
 drivers/perf/thunderx2_pmu.c | 248 ++-
 1 file changed, 214 insertions(+), 34 deletions(-)

diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
index 43d76c85da56..a4e1273eafa3 100644
--- a/drivers/perf/thunderx2_pmu.c
+++ b/drivers/perf/thunderx2_pmu.c
@@ -17,22 +17,31 @@
  */
 
 #define TX2_PMU_MAX_COUNTERS   4
+
 #define TX2_PMU_DMC_CHANNELS   8
 #define TX2_PMU_L3_TILES   16
 
 #define TX2_PMU_HRTIMER_INTERVAL   (2 * NSEC_PER_SEC)
-#define GET_EVENTID(ev)((ev->hw.config) & 0x1f)
-#define GET_COUNTERID(ev)  ((ev->hw.idx) & 0x3)
+#define GET_EVENTID(ev, mask)  ((ev->hw.config) & mask)
+#define GET_COUNTERID(ev, mask)((ev->hw.idx) & mask)
  /* 1 byte per counter(4 counters).
   * Event id is encoded in bits [5:1] of a byte,
   */
 #define DMC_EVENT_CFG(idx, val)((val) << (((idx) * 8) + 1))
 
+/* bits[3:0] to select counters, are indexed from 8 to 15. */
+#define CCPI2_COUNTER_OFFSET   8
+
 #define L3C_COUNTER_CTL0xA8
 #define L3C_COUNTER_DATA   0xAC
 #define DMC_COUNTER_CTL0x234
 #define DMC_COUNTER_DATA   0x240
 
+#define CCPI2_PERF_CTL 0x108
+#define CCPI2_COUNTER_CTL  0x10C
+#define CCPI2_COUNTER_SEL  0x12c
+#define CCPI2_COUNTER_DATA_L   0x130
+
 /* L3C event IDs */
 #define L3_EVENT_READ_REQ  0xD
 #define L3_EVENT_WRITEBACK_REQ 0xE
@@ -51,15 +60,23 @@
 #define DMC_EVENT_READ_TXNS0xF
 #define DMC_EVENT_MAX  0x10
 
+#define CCPI2_EVENT_REQ_PKT_SENT   0x3D
+#define CCPI2_EVENT_SNOOP_PKT_SENT 0x65
+#define CCPI2_EVENT_DATA_PKT_SENT  0x105
+#define CCPI2_EVENT_GIC_PKT_SENT   0x12D
+#define CCPI2_EVENT_MAX0x200
+
 enum tx2_uncore_type {
PMU_TYPE_L3C,
PMU_TYPE_DMC,
+   PMU_TYPE_CCPI2,
PMU_TYPE_INVALID,
 };
 
+
 /*
- * pmu on each socket has 2 uncore devices(dmc and l3c),
- * each device has 4 counters.
+ * pmu on each socket has 3 uncore devices(dmc, l3ci and ccpi2),
+ * dmc and l3c has 4 counters and ccpi2 8.
  */
 struct tx2_uncore_pmu {
struct hlist_node hpnode;
@@ -69,12 +86,14 @@ struct tx2_uncore_pmu {
int node;
int cpu;
u32 max_counters;
+   u32 counters_mask;
u32 prorate_factor;
u32 max_events;
+   u32 events_mask;
u64 hrtimer_interval;
void __iomem *base;
DECLARE_BITMAP(active_counters, TX2_PMU_MAX_COUNTERS);
-   struct perf_event *events[TX2_PMU_MAX_COUNTERS];
+   struct perf_event **events;
struct device *dev;
struct hrtimer hrtimer;
const struct attribute_group **attr_groups;
@@ -92,7 +111,21 @@ static inline struct tx2_uncore_pmu *pmu_to_tx2_pmu(struct 
pmu *pmu)
return container_of(pmu, struct tx2_uncore_pmu, pmu);
 }
 
-PMU_FORMAT_ATTR(event, "config:0-4");
+#define TX2_PMU_FORMAT_ATTR(_var, _name, _format)  \
+static ssize_t \
+__tx2_pmu_##_var##_show(struct device *dev,\
+  struct device_attribute *attr,   \
+  char *page)  \
+{  \
+   BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+   return sprintf(page, _format "\n"); \
+}  \
+   \
+static struct device_attribute format_attr_##_var =\
+   __ATTR(_name, 0444, __tx2_pmu_##_var##_show, NULL)
+
+TX2_PMU_FORMAT_ATTR(event, event, "config:0-4");
+TX2_PMU_FORMAT_ATTR(event_ccpi2, event, "config:0-9");
 
 static struct attribute *l3c_pmu_format_attrs[] = {
&format_attr_event.attr,
@@ -104,6 +137,11 @@ static struct attribute *dmc_pmu_format_attrs[] = {
NULL,
 };
 
+static struct attribute *ccpi2_pmu_format_attrs[] = {
+   &format_attr_event_ccpi2.attr,
+   NULL,
+};
+
 static const struct attribute_group l3c_pmu_format_attr_group = {
.name = "format",
.attrs = l3c_pmu_format_attrs,
@@ -114,6 +152,11 @@ static const struct attribute_group 
dmc_pmu_format_attr_group = {
.attrs = dmc_pmu_format_attrs,
 };
 
+static const struct attribute_group ccpi2_pmu_format_attr_group = {
+   .name = "format",
+   .attrs = ccpi2_pmu_format_attrs,
+};
+
 /*
  * sysfs event attributes
  */
@@ -164,6 +207,19 @@ static struct attribute *dmc_p

[PATCH v3 0/2] Add CCPI2 PMU support

2019-07-23 Thread Ganapatrao Kulkarni
Add Cavium Coherent Processor Interconnect (CCPI2) PMU
support in ThunderX2 Uncore driver.

v3: Rebased to 5.3-rc1

v2: Updated with review comments [1]

[1] https://lkml.org/lkml/2019/6/14/965

v1: initial patch

Ganapatrao Kulkarni (2):
  Documentation: perf: Update documentation for ThunderX2 PMU uncore
driver
  drivers/perf: Add CCPI2 PMU support in ThunderX2 UNCORE driver.

 .../admin-guide/perf/thunderx2-pmu.rst|  20 +-
 drivers/perf/thunderx2_pmu.c  | 248 +++---
 2 files changed, 225 insertions(+), 43 deletions(-)

-- 
2.17.1



RE: [PATCH v5] Documentation/checkpatch: Prefer strscpy/strscpy_pad over strcpy/strlcpy/strncpy

2019-07-23 Thread Gote, Nitin R


> -Original Message-
> From: Joe Perches [mailto:j...@perches.com]
> Sent: Monday, July 22, 2019 11:11 PM
> To: Kees Cook ; Gote, Nitin R
> 
> Cc: cor...@lwn.net; a...@linux-foundation.org; a...@canonical.com;
> linux-doc@vger.kernel.org; kernel-harden...@lists.openwall.com
> Subject: Re: [PATCH v5] Documentation/checkpatch: Prefer
> strscpy/strscpy_pad over strcpy/strlcpy/strncpy
> 
> On Mon, 2019-07-22 at 10:30 -0700, Kees Cook wrote:
> > On Wed, Jul 17, 2019 at 10:00:05AM +0530, NitinGote wrote:
> > > From: Nitin Gote 
> > >
> > > Added check in checkpatch.pl to
> > > 1. Deprecate strcpy() in favor of strscpy().
> > > 2. Deprecate strlcpy() in favor of strscpy().
> > > 3. Deprecate strncpy() in favor of strscpy() or strscpy_pad().
> > >
> > > Updated strncpy() section in Documentation/process/deprecated.rst
> > > to cover strscpy_pad() case.
> > >
> > > Signed-off-by: Nitin Gote 
> >
> > Reviewed-by: Kees Cook 
> >
> > Joe, does this address your checkpatch concerns?
> 
> Well, kinda.
> 
> strscpy_pad isn't used anywhere in the kernel.
> 
> And
> 
> +"strncpy"=> "strscpy, strscpy_pad or for 
> non-
> NUL-terminated strings, strncpy() can still be used, but destinations should
> be marked with __nonstring",
> 
> is a bit verbose.  This could be simply:
> 
> +"strncpy" => "strscpy - for non-NUL-terminated uses, strncpy() dst
> should be __nonstring",
> 

But, if the destination buffer needs extra NUL-padding for remaining size of 
destination, 
then safe replacement is strscpy_pad().  Right?  If yes, then what is your 
opinion on below change :

"strncpy" => "strscpy, strcpy_pad - for non-NUL-terminated uses, 
strncpy() dst
should be __nonstring",


> And I still prefer adding stracpy as it
> reduces code verbosity and eliminates defects.
> 

-Nitin


Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Konstantin Khlebnikov

On 23.07.2019 11:43, Konstantin Khlebnikov wrote:

On 23.07.2019 0:32, Joel Fernandes (Google) wrote:

The page_idle tracking feature currently requires looking up the pagemap
for a process followed by interacting with /sys/kernel/mm/page_idle.
This is quite cumbersome and can be error-prone too. If between
accessing the per-PID pagemap and the global page_idle bitmap, if
something changes with the page then the information is not accurate.
More over looking up PFN from pagemap in Android devices is not
supported by unprivileged process and requires SYS_ADMIN and gives 0 for
the PFN.

This patch adds support to directly interact with page_idle tracking at
the PID level by introducing a /proc//page_idle file. This
eliminates the need for userspace to calculate the mapping of the page.
It follows the exact same semantics as the global
/sys/kernel/mm/page_idle, however it is easier to use for some usecases
where looking up PFN is not needed and also does not require SYS_ADMIN.
It ended up simplifying userspace code, solving the security issue
mentioned and works quite well. SELinux does not need to be turned off
since no pagemap look up is needed.

In Android, we are using this for the heap profiler (heapprofd) which
profiles and pin points code paths which allocates and leaves memory
idle for long periods of time.

Documentation material:
The idle page tracking API for virtual address indexing using virtual page
frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
except that it uses virtual instead of physical frame numbers.

This idle page tracking API can be simpler to use than physical address
indexing, since the pagemap for a process does not need to be looked up
to mark or read a page's idle bit. It is also more accurate than
physical address indexing since in physical address indexing, address
space changes can occur between reading the pagemap and reading the
bitmap. In virtual address indexing, the process's mmap_sem is held for
the duration of the access.


Maybe integrate this into existing interface: /proc/pid/clear_refs and
/proc/pid/pagemap ?

I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
marks pages idle only for pages mapped in this process.
And idle bit in /proc/pid/pagemap tells that page is still idle in this process.
This is faster - we don't need to walk whole rmap for that.


Moreover, this is so cheap so could be counted and shown in smaps.
Unlike to clearing real access bits this does not disrupt memory reclaimer.
Killer feature.





Cc: vdavydov@gmail.com
Cc: Brendan Gregg 
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) 

---
Internal review -> v1:
Fixes from Suren.
Corrections to change log, docs (Florian, Sandeep)

  fs/proc/base.c    |   3 +
  fs/proc/internal.h    |   1 +
  fs/proc/task_mmu.c    |  57 +++
  include/linux/page_idle.h |   4 +
  mm/page_idle.c    | 305 +-
  5 files changed, 330 insertions(+), 40 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 77eb628ecc7f..a58dd74606e9 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
  REG("smaps",  S_IRUGO, proc_pid_smaps_operations),
  REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
  REG("pagemap",    S_IRUSR, proc_pagemap_operations),
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+    REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations),
+#endif
  #endif
  #ifdef CONFIG_SECURITY
  DIR("attr",   S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, 
proc_attr_dir_operations),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cd0c8d5ce9a1..bc9371880c63 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -293,6 +293,7 @@ extern const struct file_operations 
proc_pid_smaps_operations;
  extern const struct file_operations proc_pid_smaps_rollup_operations;
  extern const struct file_operations proc_clear_refs_operations;
  extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_page_idle_operations;
  extern unsigned long task_vsize(struct mm_struct *);
  extern unsigned long task_statm(struct mm_struct *,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4d2b860dbc3f..11ccc53da38e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = {
  .open    = pagemap_open,
  .release    = pagemap_release,
  };
+
+#ifdef CONFIG_IDLE_PAGE_TRACKING
+static ssize_t proc_page_idle_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+    int ret;
+    struct task_struct *tsk = get_proc_task(file_inode(file));
+
+    if (!tsk)
+    return -EINVAL;
+    ret = page_idle_proc_read(file, buf, count, ppos, tsk);
+    put_task_struct(tsk)

[PATCH] Documentation: filesystem: fix "Removed Sysctls" table

2019-07-23 Thread Sheriff Esseson
the "Removed Sysctls" section is a table - bring it alive with ReST.

Signed-off-by: Sheriff Esseson 
---
 Documentation/admin-guide/xfs.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/xfs.rst 
b/Documentation/admin-guide/xfs.rst
index e76665a8f2f2..fb5b39f73059 100644
--- a/Documentation/admin-guide/xfs.rst
+++ b/Documentation/admin-guide/xfs.rst
@@ -337,11 +337,12 @@ None at present.
 Removed Sysctls
 ===
 
+=  ===
   Name Removed
-   ---
+=  ===
   fs.xfs.xfsbufd_centisec  v4.0
   fs.xfs.age_buffer_centisecs  v4.0
-
+=  ===
 
 Error handling
 ==
-- 
2.22.0



Re: [PATCH 1/2] arm64: mm: Restore mm_cpumask (revert commit 38d96287504a ("arm64: mm: kill mm_cpumask usage"))

2019-07-23 Thread Catalin Marinas
Hi,

I know Will is on the case but just expressing some thoughts of my own.

On Mon, Jun 17, 2019 at 11:32:54PM +0900, Takao Indoh wrote:
> From: Takao Indoh 
> 
> mm_cpumask was deleted by the commit 38d96287504a ("arm64: mm: kill
> mm_cpumask usage") because it was not used at that time. Now this is needed
> to find appropriate CPUs for TLB flush, so this patch reverts this commit.
> 
> Signed-off-by: QI Fuli 
> Signed-off-by: Takao Indoh 
> ---
>  arch/arm64/include/asm/mmu_context.h | 7 ++-
>  arch/arm64/kernel/smp.c  | 6 ++
>  arch/arm64/mm/context.c  | 2 ++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h 
> b/arch/arm64/include/asm/mmu_context.h
> index 2da3e478fd8f..21ef11590bcb 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -241,8 +241,13 @@ static inline void
>  switch_mm(struct mm_struct *prev, struct mm_struct *next,
> struct task_struct *tsk)
>  {
> - if (prev != next)
> + unsigned int cpu = smp_processor_id();
> +
> + if (prev != next) {
>   __switch_mm(next);
> + cpumask_clear_cpu(cpu, mm_cpumask(prev));
> + local_flush_tlb_mm(prev);
> + }

That's not actually a revert as we've never flushed the TLBs on the
switch_mm() path. Also, this flush is not sufficient on a CnP capable
CPU since another thread of the same CPU could have the prev TTBR0_EL1
value set and loading the TLB back.

-- 
Catalin


Re: [PATCH 2/2] arm64: tlb: Add boot parameter to disable TLB flush within the same inner shareable domain

2019-07-23 Thread Catalin Marinas
On Mon, Jun 17, 2019 at 11:32:55PM +0900, Takao Indoh wrote:
> From: Takao Indoh 
> 
> This patch adds new boot parameter 'disable_tlbflush_is' to disable TLB
> flush within the same inner shareable domain for performance tuning.
> 
> In the case of flush_tlb_mm() *without* this parameter, TLB entry is
> invalidated by __tlbi(aside1is, asid). By this instruction, all CPUs within
> the same inner shareable domain check if there are TLB entries which have
> this ASID, this causes performance noise, especially at large-scale HPC
> environment, which has more than thousand nodes with low latency
> interconnect.
> 
> When this new parameter is specified, TLB entry is invalidated by
> __tlbi(aside1, asid) only on the CPUs specified by mm_cpumask(mm).
> Therefore TLB flush is done on minimal CPUs and performance problem does
> not occur.
> 
> Signed-off-by: QI Fuli 
> Signed-off-by: Takao Indoh 
[...]
> +void flush_tlb_mm(struct mm_struct *mm)
> +{
> + if (disable_tlbflush_is)
> + on_each_cpu_mask(mm_cpumask(mm), ipi_flush_tlb_mm,
> +  (void *)mm, true);
> + else
> + __flush_tlb_mm(mm);
> +}

Could we try instead to call a _nosync variant here when
cpumask_weight() is 1 or the *IS if greater than 1 and avoid the IPI?

Will tried this in the past but because of the task placement after
fork()+execve(), I think we always ended up with a weight of 2 in the
child process. Your first patch "solves" this by flushing the TLBs on
context switch (bar the CnP case). Can you give it a try to see if it
improves things? At least it's a starting point for further
investigation.

I fully agree with Will that we don't want two different TLB handling
implementations in the arm64 kernel and even less desirable to have a
command line option.

Thanks.

-- 
Catalin


[PATCH] Documentation/features/locking: update lists

2019-07-23 Thread Mark Rutland
The locking feature lists don't match reality as of v5.3-rc1:

* arm64 moved to queued spinlocks in commit:

  c11090474d70590170cf5fa6afe85864ab494b37

  ("arm64: locking: Replace ticket lock implementation with qspinlock")

* xtensa moved to queued spinlocks and rwlocks in commit:

  579afe866f52adcd921272a224ab36733051059c

  ("xtensa: use generic spinlock/rwlock implementation")

* architecture-specific rwsem support was removed in commit:

  46ad0840b1584b92b5ff2cc3ed0b011dd6b8e0f1

  ("locking/rwsem: Remove arch specific rwsem files")

So update the feature lists accordingly, and remove the now redundant
rwsem-optimized list.

Signed-off-by: Mark Rutland 
Cc: Jonathan Corbet 
Cc: linux-doc@vger.kernel.org
---
 .../locking/queued-rwlocks/arch-support.txt|  2 +-
 .../locking/queued-spinlocks/arch-support.txt  |  4 +--
 .../locking/rwsem-optimized/arch-support.txt   | 34 --
 3 files changed, 3 insertions(+), 37 deletions(-)
 delete mode 100644 
Documentation/features/locking/rwsem-optimized/arch-support.txt

diff --git a/Documentation/features/locking/queued-rwlocks/arch-support.txt 
b/Documentation/features/locking/queued-rwlocks/arch-support.txt
index c683da198f31..ee922746a64c 100644
--- a/Documentation/features/locking/queued-rwlocks/arch-support.txt
+++ b/Documentation/features/locking/queued-rwlocks/arch-support.txt
@@ -30,5 +30,5 @@
 |  um: | TODO |
 |   unicore32: | TODO |
 | x86: |  ok  |
-|  xtensa: | TODO |
+|  xtensa: |  ok  |
 ---
diff --git a/Documentation/features/locking/queued-spinlocks/arch-support.txt 
b/Documentation/features/locking/queued-spinlocks/arch-support.txt
index e3080b82aefd..c52116c1a049 100644
--- a/Documentation/features/locking/queued-spinlocks/arch-support.txt
+++ b/Documentation/features/locking/queued-spinlocks/arch-support.txt
@@ -9,7 +9,7 @@
 |   alpha: | TODO |
 | arc: | TODO |
 | arm: | TODO |
-|   arm64: | TODO |
+|   arm64: |  ok  |
 | c6x: | TODO |
 |csky: | TODO |
 |   h8300: | TODO |
@@ -30,5 +30,5 @@
 |  um: | TODO |
 |   unicore32: | TODO |
 | x86: |  ok  |
-|  xtensa: | TODO |
+|  xtensa: |  ok  |
 ---
diff --git a/Documentation/features/locking/rwsem-optimized/arch-support.txt 
b/Documentation/features/locking/rwsem-optimized/arch-support.txt
deleted file mode 100644
index 7521d7500fbe..
--- a/Documentation/features/locking/rwsem-optimized/arch-support.txt
+++ /dev/null
@@ -1,34 +0,0 @@
-#
-# Feature name:  rwsem-optimized
-# Kconfig:   !RWSEM_GENERIC_SPINLOCK
-# description:   arch provides optimized rwsem APIs
-#
----
-| arch |status|
----
-|   alpha: |  ok  |
-| arc: | TODO |
-| arm: |  ok  |
-|   arm64: |  ok  |
-| c6x: | TODO |
-|csky: | TODO |
-|   h8300: | TODO |
-| hexagon: | TODO |
-|ia64: |  ok  |
-|m68k: | TODO |
-|  microblaze: | TODO |
-|mips: | TODO |
-|   nds32: | TODO |
-|   nios2: | TODO |
-|openrisc: | TODO |
-|  parisc: | TODO |
-| powerpc: | TODO |
-|   riscv: | TODO |
-|s390: |  ok  |
-|  sh: |  ok  |
-|   sparc: |  ok  |
-|  um: |  ok  |
-|   unicore32: | TODO |
-| x86: |  ok  |
-|  xtensa: |  ok  |
----
-- 
2.11.0



Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table

2019-07-23 Thread Jonathan Corbet
On Tue, 23 Jul 2019 12:48:13 +0100
Sheriff Esseson  wrote:

> the "Removed Sysctls" section is a table - bring it alive with ReST.
> 
> Signed-off-by: Sheriff Esseson 

So this appears to be identical to the patch you sent three days ago; is
there a reason why you are sending it again now?

Thanks,

jon


Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Joel Fernandes
On Tue, Jul 23, 2019 at 01:10:05PM +0300, Konstantin Khlebnikov wrote:
> On 23.07.2019 11:43, Konstantin Khlebnikov wrote:
> > On 23.07.2019 0:32, Joel Fernandes (Google) wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > This is quite cumbersome and can be error-prone too. If between
> > > accessing the per-PID pagemap and the global page_idle bitmap, if
> > > something changes with the page then the information is not accurate.
> > > More over looking up PFN from pagemap in Android devices is not
> > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > the PFN.
> > > 
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc//page_idle file. This
> > > eliminates the need for userspace to calculate the mapping of the page.
> > > It follows the exact same semantics as the global
> > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > > It ended up simplifying userspace code, solving the security issue
> > > mentioned and works quite well. SELinux does not need to be turned off
> > > since no pagemap look up is needed.
> > > 
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time.
> > > 
> > > Documentation material:
> > > The idle page tracking API for virtual address indexing using virtual page
> > > frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
> > > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > > except that it uses virtual instead of physical frame numbers.
> > > 
> > > This idle page tracking API can be simpler to use than physical address
> > > indexing, since the pagemap for a process does not need to be looked up
> > > to mark or read a page's idle bit. It is also more accurate than
> > > physical address indexing since in physical address indexing, address
> > > space changes can occur between reading the pagemap and reading the
> > > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > > the duration of the access.
> > 
> > Maybe integrate this into existing interface: /proc/pid/clear_refs and
> > /proc/pid/pagemap ?
> > 
> > I.e.  echo X > /proc/pid/clear_refs clears reference bits in ptes and
> > marks pages idle only for pages mapped in this process.
> > And idle bit in /proc/pid/pagemap tells that page is still idle in this 
> > process.
> > This is faster - we don't need to walk whole rmap for that.
> 
> Moreover, this is so cheap so could be counted and shown in smaps.
> Unlike to clearing real access bits this does not disrupt memory reclaimer.
> Killer feature.

I replied to your patch:
https://lore.kernel.org/lkml/20190723134647.ga104...@google.com/T/#med8992e75c32d9c47f95b119d24a43ded36420bc



[PATCH V2 2/2] kernel-doc: core-api: Include string.h into core-api

2019-07-23 Thread Joe Perches
core-api should show all the various string functions including the
newly added stracpy and stracpy_pad.

Miscellanea:

o Update the Returns: value for strscpy
o fix a defect with %NUL)

Signed-off-by: Joe Perches 
---

V2: Correct return of -E2BIG descriptions

 Documentation/core-api/kernel-api.rst |  3 +++
 include/linux/string.h|  5 +++--
 lib/string.c  | 10 ++
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 08af5caf036d..f77de49b1d51 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -42,6 +42,9 @@ String Manipulation
 .. kernel-doc:: lib/string.c
:export:
 
+.. kernel-doc:: include/linux/string.h
+   :internal:
+
 .. kernel-doc:: mm/util.c
:functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user
vmemdup_user strndup_user memdup_user_nul
diff --git a/include/linux/string.h b/include/linux/string.h
index 7572cd78cf9f..3cf684db4bc6 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -519,8 +519,9 @@ static inline void memcpy_and_pad(void *dest, size_t 
dest_len,
  * But this can lead to bugs due to typos, or if prefix is a pointer
  * and not a constant. Instead use str_has_prefix().
  *
- * Returns: 0 if @str does not start with @prefix
- strlen(@prefix) if @str does start with @prefix
+ * Returns:
+ * * strlen(@prefix) if @str starts with @prefix
+ * * 0 if @str does not start with @prefix
  */
 static __always_inline size_t str_has_prefix(const char *str, const char 
*prefix)
 {
diff --git a/lib/string.c b/lib/string.c
index 461fb620f85f..f7bc10da4259 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -173,8 +173,9 @@ EXPORT_SYMBOL(strlcpy);
  * doesn't unnecessarily force the tail of the destination buffer to be
  * zeroed.  If zeroing is desired please use strscpy_pad().
  *
- * Return: The number of characters copied (not including the trailing
- * %NUL) or -E2BIG if the destination buffer wasn't big enough.
+ * Returns:
+ * * The number of characters copied (not including the trailing %NUL)
+ * * -E2BIG if count is 0 or @src was truncated.
  */
 ssize_t strscpy(char *dest, const char *src, size_t count)
 {
@@ -253,8 +254,9 @@ EXPORT_SYMBOL(strscpy);
  * For full explanation of why you may want to consider using the
  * 'strscpy' functions please see the function docstring for strscpy().
  *
- * Return: The number of characters copied (not including the trailing
- * %NUL) or -E2BIG if the destination buffer wasn't big enough.
+ * Returns:
+ * * The number of characters copied (not including the trailing %NUL)
+ * * -E2BIG if count is 0 or @src was truncated.
  */
 ssize_t strscpy_pad(char *dest, const char *src, size_t count)
 {
-- 
2.15.0



[PATCH V2 0/2] string: Add stracpy and stracpy_pad

2019-07-23 Thread Joe Perches
Add more string copy mechanisms to help avoid defects

Joe Perches (2):
  string: Add stracpy and stracpy_pad mechanisms
  kernel-doc: core-api: Include string.h into core-api

 Documentation/core-api/kernel-api.rst |  3 +++
 include/linux/string.h| 50 +--
 lib/string.c  | 10 ---
 3 files changed, 57 insertions(+), 6 deletions(-)

-- 
2.15.0



Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Joel Fernandes
On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> Hi Joel,
> 
> On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> 
> cumbersome: That's the fair tradeoff between idle page tracking and
> clear_refs because idle page tracking could check even though the page
> is not mapped.

It is fair tradeoff, but could be made simpler. The userspace code got
reduced by a good amount as well.

> error-prone: What's the error?

We see in normal Android usage, that some of the times pages appear not to be
idle even when they really are idle. Reproducing this is a bit unpredictable
and happens at random occasions. With this new interface, we are seeing this
happen much much lesser.

> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> What you mean with error is this timing issue?
> Why do you need to be accurate? IOW, accurate is always good but what's
> the scale of the accuracy?

There is a time window between looking up pagemap and checking if page is
idle. Anyway, see below for the primary goals as you asked:

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc//page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> 
> Ah, so the primary goal is to provide convinience interface and it would
> help accurary, too. IOW, accuracy is not your main goal?

There are a couple of primary goals: Security, conveience and also solving
the accuracy/reliability problem we are seeing. Do keep in mind looking up
PFN has security implications. The PFN field in pagemap is zeroed if the user
does not have CAP_SYS_ADMIN.

> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> 
> So the goal is to detect idle pages with idle memory tracking?

Isn't that what idle memory tracking does?

> It couldn't work well because such idle pages could finally swap out and
> lose every flags of the page descriptor which is working mechanism of
> idle page tracking. It should have named "workingset page tracking",
> not "idle page tracking".

The heap profiler that uses page-idle tracking is not to measure working set,
but to look for pages that are idle for long periods of time.

Thanks for bringing up the swapping corner case..  Perhaps we can improve
the heap profiler to detect this by looking at bits 0-4 in pagemap. While it
is true that we would lose access information during the window, there is a
high likelihood that the page was not accessed which is why it was swapped.
Thoughts?

thanks,

 - Joel



> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > Cc: vdavydov@gmail.com
> > Cc: Brendan Gregg 
> > Cc: kernel-t...@android.com
> > Signed-off-by: Joel Fernandes (Google) 
> > 
> > ---
> > Internal review -> v1:
> > Fixes from Suren.
> > Corrections to change log, docs (Florian, Sandeep)
> > 
> >  fs/proc/base.c|   3 +
> >  fs/proc/internal.h|   1 +
> >  fs/proc/task_mmu.c|  57 +++
> >  include/linux/page_idle.h |   4 +
> >  mm/page_idle.c| 305 +-
> >  5 files changed, 330 insertions(+), 40 deletions(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 77eb628ecc7f..a58dd74606e9 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = {
> > REG("smaps",  S_IRUGO, proc_pid_smaps_operations),

Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches

2019-07-23 Thread Waiman Long
On 7/22/19 8:46 AM, peter enderborg wrote:
> On 7/2/19 8:37 PM, Waiman Long wrote:
>> Currently, a value of '1" is written to /sys/kernel/slab//shrink
>> file to shrink the slab by flushing all the per-cpu slabs and free
>> slabs in partial lists. This applies only to the root caches, though.
>>
>> Extends this capability by shrinking all the child memcg caches and
>> the root cache when a value of '2' is written to the shrink sysfs file.
>>
>> On a 4-socket 112-core 224-thread x86-64 system after a parallel kernel
>> build, the the amount of memory occupied by slabs before shrinking
>> slabs were:
>>
>>  # grep task_struct /proc/slabinfo
>>  task_struct 7114   7296   774448 : tunables00
>>  0 : slabdata   1824   1824  0
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab:1310444 kB
>>  SReclaimable: 377604 kB
>>  SUnreclaim:   932840 kB
>>
>> After shrinking slabs:
>>
>>  # grep "^S[lRU]" /proc/meminfo
>>  Slab: 695652 kB
>>  SReclaimable: 322796 kB
>>  SUnreclaim:   372856 kB
>>  # grep task_struct /proc/slabinfo
>>  task_struct 2262   2572   774448 : tunables00
>>  0 : slabdata643643  0
>
> What is the time between this measurement points? Should not the shrinked 
> memory show up as reclaimable?

In this case, I echoed '2' to all the shrink sysfs files under
/sys/kernel/slab. The purpose of shrinking caches is to reclaim as much
unused memory slabs from all the caches, irrespective if they are
reclaimable or not. We do not reclaim any used objects. That is why we
see the numbers were reduced in both cases.

Cheers,
Longman


Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Joel Fernandes
On Tue, Jul 23, 2019 at 08:05:25AM +0200, Michal Hocko wrote:
> [Cc linux-api - please always do CC this list when introducing a user
>  visible API]

Sorry, will do.

> On Mon 22-07-19 17:32:04, Joel Fernandes (Google) wrote:
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.
> > 
> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc//page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> 
> I didn't get to read the actual code but the overall idea makes sense to
> me. I can see this being useful for userspace memory management (along
> with remote MADV_PAGEOUT, MADV_COLD).

Thanks.

> Normally I would object that a cumbersome nature of the existing
> interface can be hidden in a userspace but I do agree that rowhammer has
> made this one close to unusable for anything but a privileged process.

Agreed, this is one of the primary motivations for the patch as you said.

> I do not think you can make any argument about accuracy because
> the information will never be accurate. Sure the race window is smaller
> in principle but you can hardly say anything about how much or whether
> at all.

Sure, fair enough. That is why I wasn't beating the drum too much on the
accuracy point. However, this surprisingly does work quite well.

thanks,

 - Joel



Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Joel Fernandes
On Mon, Jul 22, 2019 at 03:06:39PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 17:32:04 -0400 "Joel Fernandes (Google)" 
>  wrote:
> 
> > The page_idle tracking feature currently requires looking up the pagemap
> > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > This is quite cumbersome and can be error-prone too. If between
> > accessing the per-PID pagemap and the global page_idle bitmap, if
> > something changes with the page then the information is not accurate.
> 
> Well, it's never going to be "accurate" - something could change one
> nanosecond after userspace has read the data...
> 
> Presumably with this approach the data will be "more" accurate.  How
> big a problem has this inaccuracy proven to be in real-world usage?

Has proven to be quite a thorn. But the security issue is the main problem..

> > More over looking up PFN from pagemap in Android devices is not
> > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > the PFN.

..as mentioned here.

I should have emphasized on the security issue more, will do so in the next
revision.

> > This patch adds support to directly interact with page_idle tracking at
> > the PID level by introducing a /proc//page_idle file. This
> > eliminates the need for userspace to calculate the mapping of the page.
> > It follows the exact same semantics as the global
> > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > It ended up simplifying userspace code, solving the security issue
> > mentioned and works quite well. SELinux does not need to be turned off
> > since no pagemap look up is needed.
> > 
> > In Android, we are using this for the heap profiler (heapprofd) which
> > profiles and pin points code paths which allocates and leaves memory
> > idle for long periods of time.
> > 
> > Documentation material:
> > The idle page tracking API for virtual address indexing using virtual page
> > frame numbers (VFN) is located at /proc//page_idle. It is a bitmap
> > that follows the same semantics as /sys/kernel/mm/page_idle/bitmap
> > except that it uses virtual instead of physical frame numbers.
> > 
> > This idle page tracking API can be simpler to use than physical address
> > indexing, since the pagemap for a process does not need to be looked up
> > to mark or read a page's idle bit. It is also more accurate than
> > physical address indexing since in physical address indexing, address
> > space changes can occur between reading the pagemap and reading the
> > bitmap. In virtual address indexing, the process's mmap_sem is held for
> > the duration of the access.
> > 
> > ...
> >
> > --- a/mm/page_idle.c
> > +++ b/mm/page_idle.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #define BITMAP_CHUNK_SIZE  sizeof(u64)
> >  #define BITMAP_CHUNK_BITS  (BITMAP_CHUNK_SIZE * BITS_PER_BYTE)
> > @@ -28,15 +29,12 @@
> >   *
> >   * This function tries to get a user memory page by pfn as described above.
> >   */
> 
> Above comment needs updating or moving?
> 
> > -static struct page *page_idle_get_page(unsigned long pfn)
> > +static struct page *page_idle_get_page(struct page *page_in)
> >  {
> > struct page *page;
> > pg_data_t *pgdat;
> >  
> > -   if (!pfn_valid(pfn))
> > -   return NULL;
> > -
> > -   page = pfn_to_page(pfn);
> > +   page = page_in;
> > if (!page || !PageLRU(page) ||
> > !get_page_unless_zero(page))
> > return NULL;
> >
> > ...
> >
> > +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct 
> > *mm,
> > +   unsigned long *start, unsigned long *end)
> > +{
> > +   unsigned long max_frame;
> > +
> > +   /* If an mm is not given, assume we want physical frames */
> > +   max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn;
> > +
> > +   if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE)
> > +   return -EINVAL;
> > +
> > +   *start = pos * BITS_PER_BYTE;
> > +   if (*start >= max_frame)
> > +   return -ENXIO;
> 
> Is said to mean "The system tried to use the device represented by a
> file you specified, and it couldnt find the device.  This can mean that
> the device file was installed incorrectly, or that the physical device
> is missing or not correctly attached to the computer."
> 
> This doesn't seem appropriate in this usage and is hence possibly
> misleading.  Someone whose application fails with ENXIO will be
> scratching their heads.

This actually keeps it consistent with the current code. I refactored that
code a bit and I'm reusing parts of it to keep lines of code less. See
page_idle_bitmap_write where it returns -ENXIO in current upstream.

However note that I am actually returning 0 if page_idle_bitmap_write()
returns -ENXIO:

+   ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn);
+   if (ret == -ENXIO)
+ 

Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table

2019-07-23 Thread Sheriff Esseson
On Tue, Jul 23, 2019 at 07:42:18AM -0600, Jonathan Corbet wrote:
> On Tue, 23 Jul 2019 12:48:13 +0100
> Sheriff Esseson  wrote:
> 
> > the "Removed Sysctls" section is a table - bring it alive with ReST.
> > 
> > Signed-off-by: Sheriff Esseson 
> 
> So this appears to be identical to the patch you sent three days ago; is
> there a reason why you are sending it again now?
> 
> Thanks,
> 
> jon

Sorry, I was think the patch went unnoticed during the merge window - I could
not find a response.


[PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-07-23 Thread Dave Chiluk
It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: Dave Chiluk 
Reviewed-by: Ben Segall 
---
 Documentation/scheduler/sched-bwc.rst | 74 ---
 kernel/sched/fair.c   | 72 --
 kernel/sched/sched.h  |  4 --
 3 files changed, 67 insertions(+), 83 deletions(-)

diff --git a/Documentation/scheduler/sched-bwc.rst 
b/Documentation/scheduler/sched-bwc.rst
index 3a90642..9801d6b 100644
--- a/Documentation/scheduler/sched-bwc.rst
+++ b/Documentation/scheduler/sched-bwc.rst
@@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension 
which allows the
 specification of the maximum CPU bandwidth available to a group or hierarchy.
 
 The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary.  As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis.  The amount transferred
+each given "period" (microseconds), a task group is allocated up to "quota"
+microseconds of CPU time. That quota is assigned to per-cpu run queues in
+slices as threads in the cgroup become runnable. Once all quota has been
+assigned any additional requests for quota will result in those threads being
+throttled. Throttled threads will not be able to run again until the next
+period when the quota is replenished.
+
+A group's unassigned quota is globally tracked, being refreshed back to
+cfs_quota units at each period

[PATCH v6 0/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-07-23 Thread Dave Chiluk
Changelog v6
- Added back missing call to lsub_positive(&cfs_b->runtime, runtime);
- Added Reviewed-by: Ben Segall 
- Fix some grammar in the Documentation, and change some wording.
- Updated documentation due to the .rst change

Changelog v5
- Based on this comment from Ben Segall's comment on v4
> If the cost of taking this global lock across all cpus without a
> ratelimit was somehow not a problem, I'd much prefer to just set
> min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie
> and sorta have 2x period 2x runtime" solution of removing expiration)
I'm resubmitting my v3 patchset, with the requested changes.
- Updated Commit log given review comments
- Update sched-bwc.txt give my new understanding of the slack timer.

Changelog v4
- Rewrote patchset around the concept of returning all of runtime_remaining
when cfs_b nears the end of available quota.

Changelog v3
- Reworked documentation to better describe behavior of slice expiration per
feedback from Peter Oskolkov

Changelog v2
- Fixed some checkpatch errors in the commit message.


Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-07-23 Thread Phil Auld
Hi Dave,

On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
> 
> This has been root caused to cpu-local run queue being allocated per cpu
> bandwidth slices, and then not fully using that slice within the period.
> At which point the slice and quota expires. This expiration of unused
> slice results in applications not being able to utilize the quota for
> which they are allocated.
> 
> The non-expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this had been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
> 
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
>   /* extend local deadline, drift is bounded above by 2 ticks */
>   cfs_rq->runtime_expires += TICK_NSEC;
> 
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
> 
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevent the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
> 
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
> 
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
> 
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk 
> Reviewed-by: Ben Segall 

This still works for me. The documentation reads pretty well, too. Good job.

Feel free to add my Acked-by: or Reviewed-by: Phil Auld .

I'll run it through some more tests when I have time. The code is the same
as the earlier one I tested from what I can see.

Cheers,
Phil

> ---
>  Documentation/scheduler/sched-bwc.rst | 74 
> ---
>  kernel/sched/fair.c   | 72 --
>  kernel/sched/sched.h  |  4 --
>  3 files changed, 67 insertions(+), 83 deletions(-)
> 
> diff --git a/Documentation/scheduler/sched-bwc.rst 
> b/Documentation/scheduler/sched-bwc.rst
> index 3a90642..9801d6b 100644
> --- a/Documentation/scheduler/sched-bwc.rst
> +++ b/Documentation/scheduler/sched-bwc.rst
> @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED 
> extension which allows the
>  specification of the maximum CPU bandwidth available to a group or hierarchy.
>  
>  The bandwidth allowed for a group is specified using a quota and period. 
> Within
> -each given "period" (microseconds), a group is allowed to consume only up to
> -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
> -group exceeds this limit (for that period), the tasks belonging to its
> -hierarchy will be throttled and are not allowed to run again until the next
> -period.
> -
> -A group's unused runtime is globally tracked, being refreshed with quota 
> units
> -above at each period boundary.  As threads consume this bandwidth it is
> -transferred to cpu-local "silos" on a demand basis.  The amount transferred
> +

Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings

2019-07-23 Thread Rob Herring
On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan  wrote:
>
> Add device-links after the devices are created (but before they are
> probed) by looking at common DT bindings like clocks and
> interconnects.

The structure now looks a lot better to me. A few minor things below.

>
> Automatically adding device-links for functional dependencies at the
> framework level provides the following benefits:
>
> - Optimizes device probe order and avoids the useless work of
>   attempting probes of devices that will not probe successfully
>   (because their suppliers aren't present or haven't probed yet).
>
>   For example, in a commonly available mobile SoC, registering just
>   one consumer device's driver at an initcall level earlier than the
>   supplier device's driver causes 11 failed probe attempts before the
>   consumer device probes successfully. This was with a kernel with all
>   the drivers statically compiled in. This problem gets a lot worse if
>   all the drivers are loaded as modules without direct symbol
>   dependencies.
>
> - Supplier devices like clock providers, interconnect providers, etc
>   need to keep the resources they provide active and at a particular
>   state(s) during boot up even if their current set of consumers don't
>   request the resource to be active. This is because the rest of the
>   consumers might not have probed yet and turning off the resource
>   before all the consumers have probed could lead to a hang or
>   undesired user experience.
>
>   Some frameworks (Eg: regulator) handle this today by turning off
>   "unused" resources at late_initcall_sync and hoping all the devices
>   have probed by then. This is not a valid assumption for systems with
>   loadable modules. Other frameworks (Eg: clock) just don't handle
>   this due to the lack of a clear signal for when they can turn off
>   resources. This leads to downstream hacks to handle cases like this
>   that can easily be solved in the upstream kernel.
>
>   By linking devices before they are probed, we give suppliers a clear
>   count of the number of dependent consumers. Once all of the
>   consumers are active, the suppliers can turn off the unused
>   resources without making assumptions about the number of consumers.
>
> By default we just add device-links to track "driver presence" (probe
> succeeded) of the supplier device. If any other functionality provided
> by device-links are needed, it is left to the consumer/supplier
> devices to change the link when they probe.
>
> Signed-off-by: Saravana Kannan 
> ---
>  .../admin-guide/kernel-parameters.txt |   5 +
>  drivers/of/platform.c | 158 ++
>  2 files changed, 163 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 138f6664b2e2..109b4310844f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3141,6 +3141,11 @@
> This can be set from sysctl after boot.
> See Documentation/sysctl/vm.txt for details.
>
> +   of_devlink  [KNL] Make device links from common DT bindings. 
> Useful
> +   for optimizing probe order and making sure resources
> +   aren't turned off before the consumer devices have
> +   probed.
> +
> ohci1394_dma=early  [HW] enable debugging via the ohci1394 driver.
> See Documentation/debugging-via-ohci1394.txt for more
> info.
> diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> index 04ad312fd85b..88a2086e26fa 100644
> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node 
> *root,
>  }
>  EXPORT_SYMBOL_GPL(of_platform_default_populate);
>
> +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> +{
> +   of_node_get(sup);
> +   /*
> +* Don't allow linking a device node as a consumer of one of its
> +* descendant nodes. By definition, a child node can't be a functional
> +* dependency for the parent node.
> +*/
> +   while (sup) {
> +   if (sup == con) {
> +   of_node_put(sup);
> +   return false;
> +   }
> +   sup = of_get_next_parent(sup);
> +   }
> +   return true;
> +}
> +
> +static int of_link_to_phandle(struct device *dev, struct device_node *sup_np)
> +{
> +   struct platform_device *sup_dev;
> +   u32 dl_flags = DL_FLAG_AUTOPROBE_CONSUMER;
> +   int ret = 0;
> +
> +   /*
> +* Since we are trying to create device links, we need to find
> +* the actual device node that owns this supplier phandle.
> +* Often times it's the same node, but sometimes it can be one
> +* of the parents. S

Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings

2019-07-23 Thread Saravana Kannan
On Tue, Jul 23, 2019 at 11:06 AM Rob Herring  wrote:
>
> On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan  wrote:
> >
> > Add device-links after the devices are created (but before they are
> > probed) by looking at common DT bindings like clocks and
> > interconnects.
>
> The structure now looks a lot better to me. A few minor things below.

Thanks.

> >
> > Automatically adding device-links for functional dependencies at the
> > framework level provides the following benefits:
> >
> > - Optimizes device probe order and avoids the useless work of
> >   attempting probes of devices that will not probe successfully
> >   (because their suppliers aren't present or haven't probed yet).
> >
> >   For example, in a commonly available mobile SoC, registering just
> >   one consumer device's driver at an initcall level earlier than the
> >   supplier device's driver causes 11 failed probe attempts before the
> >   consumer device probes successfully. This was with a kernel with all
> >   the drivers statically compiled in. This problem gets a lot worse if
> >   all the drivers are loaded as modules without direct symbol
> >   dependencies.
> >
> > - Supplier devices like clock providers, interconnect providers, etc
> >   need to keep the resources they provide active and at a particular
> >   state(s) during boot up even if their current set of consumers don't
> >   request the resource to be active. This is because the rest of the
> >   consumers might not have probed yet and turning off the resource
> >   before all the consumers have probed could lead to a hang or
> >   undesired user experience.
> >
> >   Some frameworks (Eg: regulator) handle this today by turning off
> >   "unused" resources at late_initcall_sync and hoping all the devices
> >   have probed by then. This is not a valid assumption for systems with
> >   loadable modules. Other frameworks (Eg: clock) just don't handle
> >   this due to the lack of a clear signal for when they can turn off
> >   resources. This leads to downstream hacks to handle cases like this
> >   that can easily be solved in the upstream kernel.
> >
> >   By linking devices before they are probed, we give suppliers a clear
> >   count of the number of dependent consumers. Once all of the
> >   consumers are active, the suppliers can turn off the unused
> >   resources without making assumptions about the number of consumers.
> >
> > By default we just add device-links to track "driver presence" (probe
> > succeeded) of the supplier device. If any other functionality provided
> > by device-links are needed, it is left to the consumer/supplier
> > devices to change the link when they probe.
> >
> > Signed-off-by: Saravana Kannan 
> > ---
> >  .../admin-guide/kernel-parameters.txt |   5 +
> >  drivers/of/platform.c | 158 ++
> >  2 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index 138f6664b2e2..109b4310844f 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3141,6 +3141,11 @@
> > This can be set from sysctl after boot.
> > See Documentation/sysctl/vm.txt for details.
> >
> > +   of_devlink  [KNL] Make device links from common DT bindings. 
> > Useful
> > +   for optimizing probe order and making sure resources
> > +   aren't turned off before the consumer devices have
> > +   probed.
> > +
> > ohci1394_dma=early  [HW] enable debugging via the ohci1394 
> > driver.
> > See Documentation/debugging-via-ohci1394.txt for 
> > more
> > info.
> > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > index 04ad312fd85b..88a2086e26fa 100644
> > --- a/drivers/of/platform.c
> > +++ b/drivers/of/platform.c
> > @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node 
> > *root,
> >  }
> >  EXPORT_SYMBOL_GPL(of_platform_default_populate);
> >
> > +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> > +{
> > +   of_node_get(sup);
> > +   /*
> > +* Don't allow linking a device node as a consumer of one of its
> > +* descendant nodes. By definition, a child node can't be a 
> > functional
> > +* dependency for the parent node.
> > +*/
> > +   while (sup) {
> > +   if (sup == con) {
> > +   of_node_put(sup);
> > +   return false;
> > +   }
> > +   sup = of_get_next_parent(sup);
> > +   }
> > +   return true;
> > +}
> > +
> > +static int of_link_to_phandle(struct device *dev, struct device_node 
> > *sup_np)
> > +{
> > +   struct platform_device *sup_dev;
> > +   u32 dl_flags = DL_FLAG_AUTOPROBE_CONS

Re: [PATCH 2/2] kernel-doc: core-api: Include string.h into core-api

2019-07-23 Thread Kees Cook
On Mon, Jul 22, 2019 at 05:38:16PM -0700, Joe Perches wrote:
> core-api should show all the various string functions including the
> newly added stracpy and stracpy_pad.
> 
> Miscellanea:
> 
> o Update the Returns: value for strscpy
> o fix a defect with %NUL)
> 
> Signed-off-by: Joe Perches 

Reviewed-by: Kees Cook 

-Kees

> ---
>  Documentation/core-api/kernel-api.rst |  3 +++
>  include/linux/string.h|  5 +++--
>  lib/string.c  | 10 ++
>  3 files changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/core-api/kernel-api.rst 
> b/Documentation/core-api/kernel-api.rst
> index 08af5caf036d..f77de49b1d51 100644
> --- a/Documentation/core-api/kernel-api.rst
> +++ b/Documentation/core-api/kernel-api.rst
> @@ -42,6 +42,9 @@ String Manipulation
>  .. kernel-doc:: lib/string.c
> :export:
>  
> +.. kernel-doc:: include/linux/string.h
> +   :internal:
> +
>  .. kernel-doc:: mm/util.c
> :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user
> vmemdup_user strndup_user memdup_user_nul
> diff --git a/include/linux/string.h b/include/linux/string.h
> index f80b0973f0e5..329188fffc11 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -515,8 +515,9 @@ static inline void memcpy_and_pad(void *dest, size_t 
> dest_len,
>   * But this can lead to bugs due to typos, or if prefix is a pointer
>   * and not a constant. Instead use str_has_prefix().
>   *
> - * Returns: 0 if @str does not start with @prefix
> - strlen(@prefix) if @str does start with @prefix
> + * Returns:
> + * * strlen(@prefix) if @str starts with @prefix
> + * * 0 if @str does not start with @prefix
>   */
>  static __always_inline size_t str_has_prefix(const char *str, const char 
> *prefix)
>  {
> diff --git a/lib/string.c b/lib/string.c
> index 461fb620f85f..53582b6dce2a 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -173,8 +173,9 @@ EXPORT_SYMBOL(strlcpy);
>   * doesn't unnecessarily force the tail of the destination buffer to be
>   * zeroed.  If zeroing is desired please use strscpy_pad().
>   *
> - * Return: The number of characters copied (not including the trailing
> - * %NUL) or -E2BIG if the destination buffer wasn't big enough.
> + * Returns:
> + * * The number of characters copied (not including the trailing %NUL)
> + * * -E2BIG if count is 0.
>   */
>  ssize_t strscpy(char *dest, const char *src, size_t count)
>  {
> @@ -253,8 +254,9 @@ EXPORT_SYMBOL(strscpy);
>   * For full explanation of why you may want to consider using the
>   * 'strscpy' functions please see the function docstring for strscpy().
>   *
> - * Return: The number of characters copied (not including the trailing
> - * %NUL) or -E2BIG if the destination buffer wasn't big enough.
> + * Returns:
> + * * The number of characters copied (not including the trailing %NUL)
> + * * -E2BIG if count is 0.
>   */
>  ssize_t strscpy_pad(char *dest, const char *src, size_t count)
>  {
> -- 
> 2.15.0
> 

-- 
Kees Cook


Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-07-23 Thread Dave Chiluk
Thanks for all the help and testing you provided.  It's good to know
these changes have passed at least some scheduler regression tests.
If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let
Peter add it.

Will you be handling the backport into the RHEL 8 kernels?  I'll
submit this to Ubuntu and linux-stable once it gets accepted.

Thanks again,


On Tue, Jul 23, 2019 at 12:13 PM Phil Auld  wrote:
>
> Hi Dave,
>
> On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > It has been observed, that highly-threaded, non-cpu-bound applications
> > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > periods throttled while simultaneously not consuming the allocated
> > amount of quota. This use case is typical of user-interactive non-cpu
> > bound applications, such as those running in kubernetes or mesos when
> > run on multiple cpu cores.
> >
> > This has been root caused to cpu-local run queue being allocated per cpu
> > bandwidth slices, and then not fully using that slice within the period.
> > At which point the slice and quota expires. This expiration of unused
> > slice results in applications not being able to utilize the quota for
> > which they are allocated.
> >
> > The non-expiration of per-cpu slices was recently fixed by
> > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > condition")'. Prior to that it appears that this had been broken since
> > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > added the following conditional which resulted in slices never being
> > expired.
> >
> > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> >   /* extend local deadline, drift is bounded above by 2 ticks */
> >   cfs_rq->runtime_expires += TICK_NSEC;
> >
> > Because this was broken for nearly 5 years, and has recently been fixed
> > and is now being noticed by many users running kubernetes
> > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > that the mechanisms around expiring runtime should be removed
> > altogether.
> >
> > This allows quota already allocated to per-cpu run-queues to live longer
> > than the period boundary. This allows threads on runqueues that do not
> > use much CPU to continue to use their remaining slice over a longer
> > period of time than cpu.cfs_period_us. However, this helps prevent the
> > above condition of hitting throttling while also not fully utilizing
> > your cpu quota.
> >
> > This theoretically allows a machine to use slightly more than its
> > allotted quota in some periods. This overflow would be bounded by the
> > remaining quota left on each per-cpu runqueueu. This is typically no
> > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > change nothing, as they should theoretically fully utilize all of their
> > quota in each period. For user-interactive tasks as described above this
> > provides a much better user/application experience as their cpu
> > utilization will more closely match the amount they requested when they
> > hit throttling. This means that cpu limits no longer strictly apply per
> > period for non-cpu bound applications, but that they are still accurate
> > over longer timeframes.
> >
> > This greatly improves performance of high-thread-count, non-cpu bound
> > applications with low cfs_quota_us allocation on high-core-count
> > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > 80 CPU machine), this commit resulted in almost 30x performance
> > improvement, while still maintaining correct cpu quota restrictions.
> > That testcase is available at https://github.com/indeedeng/fibtest.
> >
> > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift 
> > condition")
> > Signed-off-by: Dave Chiluk 
> > Reviewed-by: Ben Segall 
>
> This still works for me. The documentation reads pretty well, too. Good job.
>
> Feel free to add my Acked-by: or Reviewed-by: Phil Auld .
>
> I'll run it through some more tests when I have time. The code is the same
> as the earlier one I tested from what I can see.
>
> Cheers,
> Phil
>
> > ---
> >  Documentation/scheduler/sched-bwc.rst | 74 
> > ---
> >  kernel/sched/fair.c   | 72 
> > --
> >  kernel/sched/sched.h  |  4 --
> >  3 files changed, 67 insertions(+), 83 deletions(-)
> >
> > diff --git a/Documentation/scheduler/sched-bwc.rst 
> > b/Documentation/scheduler/sched-bwc.rst
> > index 3a90642..9801d6b 100644
> > --- a/Documentation/scheduler/sched-bwc.rst
> > +++ b/Documentation/scheduler/sched-bwc.rst
> > @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED 
> > extension which allows the
> >  specification of the maximum CPU bandwidth available to a group or 
> > hierarchy.
> >
> >  The bandwidth allowed for a group is specified using

Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings

2019-07-23 Thread Rob Herring
On Tue, Jul 23, 2019 at 2:49 PM Saravana Kannan  wrote:
>
> On Tue, Jul 23, 2019 at 11:06 AM Rob Herring  wrote:
> >
> > On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan  
> > wrote:
> > >
> > > Add device-links after the devices are created (but before they are
> > > probed) by looking at common DT bindings like clocks and
> > > interconnects.
> >
> > The structure now looks a lot better to me. A few minor things below.
>
> Thanks.
>
> > >
> > > Automatically adding device-links for functional dependencies at the
> > > framework level provides the following benefits:
> > >
> > > - Optimizes device probe order and avoids the useless work of
> > >   attempting probes of devices that will not probe successfully
> > >   (because their suppliers aren't present or haven't probed yet).
> > >
> > >   For example, in a commonly available mobile SoC, registering just
> > >   one consumer device's driver at an initcall level earlier than the
> > >   supplier device's driver causes 11 failed probe attempts before the
> > >   consumer device probes successfully. This was with a kernel with all
> > >   the drivers statically compiled in. This problem gets a lot worse if
> > >   all the drivers are loaded as modules without direct symbol
> > >   dependencies.
> > >
> > > - Supplier devices like clock providers, interconnect providers, etc
> > >   need to keep the resources they provide active and at a particular
> > >   state(s) during boot up even if their current set of consumers don't
> > >   request the resource to be active. This is because the rest of the
> > >   consumers might not have probed yet and turning off the resource
> > >   before all the consumers have probed could lead to a hang or
> > >   undesired user experience.
> > >
> > >   Some frameworks (Eg: regulator) handle this today by turning off
> > >   "unused" resources at late_initcall_sync and hoping all the devices
> > >   have probed by then. This is not a valid assumption for systems with
> > >   loadable modules. Other frameworks (Eg: clock) just don't handle
> > >   this due to the lack of a clear signal for when they can turn off
> > >   resources. This leads to downstream hacks to handle cases like this
> > >   that can easily be solved in the upstream kernel.
> > >
> > >   By linking devices before they are probed, we give suppliers a clear
> > >   count of the number of dependent consumers. Once all of the
> > >   consumers are active, the suppliers can turn off the unused
> > >   resources without making assumptions about the number of consumers.
> > >
> > > By default we just add device-links to track "driver presence" (probe
> > > succeeded) of the supplier device. If any other functionality provided
> > > by device-links are needed, it is left to the consumer/supplier
> > > devices to change the link when they probe.
> > >
> > > Signed-off-by: Saravana Kannan 
> > > ---
> > >  .../admin-guide/kernel-parameters.txt |   5 +
> > >  drivers/of/platform.c | 158 ++
> > >  2 files changed, 163 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > > b/Documentation/admin-guide/kernel-parameters.txt
> > > index 138f6664b2e2..109b4310844f 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -3141,6 +3141,11 @@
> > > This can be set from sysctl after boot.
> > > See Documentation/sysctl/vm.txt for details.
> > >
> > > +   of_devlink  [KNL] Make device links from common DT bindings. 
> > > Useful
> > > +   for optimizing probe order and making sure 
> > > resources
> > > +   aren't turned off before the consumer devices have
> > > +   probed.
> > > +
> > > ohci1394_dma=early  [HW] enable debugging via the ohci1394 
> > > driver.
> > > See Documentation/debugging-via-ohci1394.txt for 
> > > more
> > > info.
> > > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > > index 04ad312fd85b..88a2086e26fa 100644
> > > --- a/drivers/of/platform.c
> > > +++ b/drivers/of/platform.c
> > > @@ -509,6 +509,163 @@ int of_platform_default_populate(struct device_node 
> > > *root,
> > >  }
> > >  EXPORT_SYMBOL_GPL(of_platform_default_populate);
> > >
> > > +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> > > +{
> > > +   of_node_get(sup);
> > > +   /*
> > > +* Don't allow linking a device node as a consumer of one of its
> > > +* descendant nodes. By definition, a child node can't be a 
> > > functional
> > > +* dependency for the parent node.
> > > +*/
> > > +   while (sup) {
> > > +   if (sup == con) {
> > > +   of_node_put(sup);
> > > +   return false;
> > > +   }
> > > +   

Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table

2019-07-23 Thread Dave Chinner
On Tue, Jul 23, 2019 at 03:52:01PM +0100, Sheriff Esseson wrote:
> On Tue, Jul 23, 2019 at 07:42:18AM -0600, Jonathan Corbet wrote:
> > On Tue, 23 Jul 2019 12:48:13 +0100
> > Sheriff Esseson  wrote:
> > 
> > > the "Removed Sysctls" section is a table - bring it alive with ReST.
> > > 
> > > Signed-off-by: Sheriff Esseson 
> > 
> > So this appears to be identical to the patch you sent three days ago; is
> > there a reason why you are sending it again now?
> > 
> > Thanks,
> > 
> > jon
> 
> Sorry, I was think the patch went unnoticed during the merge window - I could
> not find a response.

The correct thing to do in that case is to reply to the original
patch and ask if it has been looked at. The usual way of doing this
is quoting the commit message and replying with a "Ping?" comment
to bump it back to the top of everyone's mail stacks.

But, again, 3 days is not a long time, people tend to be extremely
busy and might take a few days to get to reviewing non-critical
changes, and people may not even review patches during the merge
window. I'd suggest waiting a week before pinging a patch you've
sent if there's been no response

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] Documentation/security-bugs: provide more information about linux-distros

2019-07-23 Thread Kees Cook
On Fri, Jul 19, 2019 at 10:42:15AM +0200, Solar Designer wrote:
> - The reporter having been directed to post from elsewhere (and I
> suspect this documentation file) without being aware of list policy.

Perhaps specify "linux-distros@" without a domain, so it's more clear?
Or re-split the Wiki into two pages to avoid confusion?

> - The reporter not mentioning (and sometimes not replying even when
> asked) whether they're also coordinating with security@k.o or whether
> they want someone on linux-distros to help coordinate with security@k.o.
> (Maybe this is something we want to write about here.)

Yeah, that seems useful to include in both places.

> - The Linux kernel bug having been introduced too recently to be of much
> interest to distros.

Right; that'd be good to add as well. I see a lot of panic on twitter,
for example, about bugs that only ever existed in -rc releases.

> > Sending to the distros@ list risks exposing Linux-only flaws to non-Linux
> > distros.
> 
> Right.
> 
> > This has caused leaks in the past
> 
> Do you mean leaks to *BSD security teams or to the public?  I'm not
> aware of past leaks to the public via the non-Linux distros present on
> the distros@ list.  Are you?

I don't know the origin of the leaks, but it only happened when distros@
was used instead of linux-distros@. I think this happened with DirtyCOW,
specifically.

-- 
Kees Cook


[PATCH v4 1/3] fs: Reserve flag for casefolding

2019-07-23 Thread Daniel Rosenberg
In preparation for including the casefold feature within f2fs, elevate
the EXT4_CASEFOLD_FL flag to FS_CASEFOLD_FL.

Signed-off-by: Daniel Rosenberg 
---
 include/uapi/linux/fs.h   | 1 +
 tools/include/uapi/linux/fs.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 59c71fa8c553a..2a616aa3f6862 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -311,6 +311,7 @@ struct fscrypt_key {
 #define FS_NOCOW_FL0x0080 /* Do not cow file */
 #define FS_INLINE_DATA_FL  0x1000 /* Reserved for ext4 */
 #define FS_PROJINHERIT_FL  0x2000 /* Create with parents 
projid */
+#define FS_CASEFOLD_FL 0x4000 /* Folder is case 
insensitive */
 #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index 59c71fa8c553a..2a616aa3f6862 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -311,6 +311,7 @@ struct fscrypt_key {
 #define FS_NOCOW_FL0x0080 /* Do not cow file */
 #define FS_INLINE_DATA_FL  0x1000 /* Reserved for ext4 */
 #define FS_PROJINHERIT_FL  0x2000 /* Create with parents 
projid */
+#define FS_CASEFOLD_FL 0x4000 /* Folder is case 
insensitive */
 #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */
 
 #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
-- 
2.22.0.657.g960e92d24f-goog



[PATCH v4 0/3] Casefolding in F2FS

2019-07-23 Thread Daniel Rosenberg
These patches are largely based on the casefolding patches for ext4

v4: Added FS_CASEFOLD_FL flag, added documentation that escaped the last
format-patch, moved setting dentry ops to f2fs_setup_casefold
v3: Addressed feedback, apart from F2FS_CASEFOLD_FL/FS_CASEFOLD_FL
Added sysfs file "encoding" to see the encoding set on a filesystem
v2: Rebased patches again master, changed f2fs_msg to f2fs_info/f2fs_err

Daniel Rosenberg (3):
  fs: Reserve flag for casefolding
  f2fs: include charset encoding information in the superblock
  f2fs: Support case-insensitive file name lookups

 Documentation/ABI/testing/sysfs-fs-f2fs |   7 ++
 Documentation/filesystems/f2fs.txt  |   3 +
 fs/f2fs/dir.c   | 126 ++--
 fs/f2fs/f2fs.h  |  21 +++-
 fs/f2fs/file.c  |  16 ++-
 fs/f2fs/hash.c  |  35 ++-
 fs/f2fs/inline.c|   4 +-
 fs/f2fs/inode.c |   4 +-
 fs/f2fs/namei.c |  21 
 fs/f2fs/super.c |  96 ++
 fs/f2fs/sysfs.c |  23 +
 include/linux/f2fs_fs.h |   9 +-
 include/uapi/linux/fs.h |   1 +
 tools/include/uapi/linux/fs.h   |   1 +
 14 files changed, 347 insertions(+), 20 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog



[PATCH v4 2/3] f2fs: include charset encoding information in the superblock

2019-07-23 Thread Daniel Rosenberg
Add charset encoding to f2fs to support casefolding. It is modeled after
the same feature introduced in commit c83ad55eaa91 ("ext4: include charset
encoding information in the superblock")

Currently this is not compatible with encryption, similar to the current
ext4 imlpementation. This will change in the future.

>From the ext4 patch:
"""
The s_encoding field stores a magic number indicating the encoding
format and version used globally by file and directory names in the
filesystem.  The s_encoding_flags defines policies for using the charset
encoding, like how to handle invalid sequences.  The magic number is
mapped to the exact charset table, but the mapping is specific to ext4.
Since we don't have any commitment to support old encodings, the only
encoding I am supporting right now is utf8-12.1.0.

The current implementation prevents the user from enabling encoding and
per-directory encryption on the same filesystem at the same time.  The
incompatibility between these features lies in how we do efficient
directory searches when we cannot be sure the encryption of the user
provided fname will match the actual hash stored in the disk without
decrypting every directory entry, because of normalization cases.  My
quickest solution is to simply block the concurrent use of these
features for now, and enable it later, once we have a better solution.
"""

Signed-off-by: Daniel Rosenberg 
---
 Documentation/ABI/testing/sysfs-fs-f2fs |  7 ++
 Documentation/filesystems/f2fs.txt  |  3 +
 fs/f2fs/f2fs.h  |  6 ++
 fs/f2fs/super.c | 95 +
 fs/f2fs/sysfs.c | 23 ++
 include/linux/f2fs_fs.h |  9 ++-
 6 files changed, 142 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs 
b/Documentation/ABI/testing/sysfs-fs-f2fs
index dca326e0ee3e1..7ab2b1b5e255f 100644
--- a/Documentation/ABI/testing/sysfs-fs-f2fs
+++ b/Documentation/ABI/testing/sysfs-fs-f2fs
@@ -251,3 +251,10 @@ Description:
If checkpoint=disable, it displays the number of blocks that 
are unusable.
 If checkpoint=enable it displays the enumber of blocks that 
would be unusable
 if checkpoint=disable were to be set.
+
+What:  /sys/fs/f2fs//encoding
+Date   July 2019
+Contact:   "Daniel Rosenberg" 
+Description:
+   Displays name and version of the encoding set for the 
filesystem.
+If no encoding is set, displays (none)
diff --git a/Documentation/filesystems/f2fs.txt 
b/Documentation/filesystems/f2fs.txt
index 496fa28b24925..5fa38ab373cab 100644
--- a/Documentation/filesystems/f2fs.txt
+++ b/Documentation/filesystems/f2fs.txt
@@ -413,6 +413,9 @@ Files in /sys/fs/f2fs/
   that would be unusable if checkpoint=disable were
   to be set.
 
+encoding This shows the encoding used for casefolding.
+  If casefolding is not enabled, returns (none)
+
 

 USAGE
 

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 17382da7f0bd9..c6c7904572d0d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -153,6 +153,7 @@ struct f2fs_mount_info {
 #define F2FS_FEATURE_LOST_FOUND0x0200
 #define F2FS_FEATURE_VERITY0x0400  /* reserved */
 #define F2FS_FEATURE_SB_CHKSUM 0x0800
+#define F2FS_FEATURE_CASEFOLD  0x1000
 
 #define __F2FS_HAS_FEATURE(raw_super, mask)\
((raw_super->feature & cpu_to_le32(mask)) != 0)
@@ -1169,6 +1170,10 @@ struct f2fs_sb_info {
int valid_super_block;  /* valid super block no */
unsigned long s_flag;   /* flags for sbi */
struct mutex writepages;/* mutex for writepages() */
+#ifdef CONFIG_UNICODE
+   struct unicode_map *s_encoding;
+   __u16 s_encoding_flags;
+#endif
 
 #ifdef CONFIG_BLK_DEV_ZONED
unsigned int blocks_per_blkz;   /* F2FS blocks per zone */
@@ -3562,6 +3567,7 @@ F2FS_FEATURE_FUNCS(quota_ino, QUOTA_INO);
 F2FS_FEATURE_FUNCS(inode_crtime, INODE_CRTIME);
 F2FS_FEATURE_FUNCS(lost_found, LOST_FOUND);
 F2FS_FEATURE_FUNCS(sb_chksum, SB_CHKSUM);
+F2FS_FEATURE_FUNCS(casefold, CASEFOLD);
 
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline bool f2fs_blkz_is_seq(struct f2fs_sb_info *sbi, int devi,
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 6de6cda440315..068db78a1e03e 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "f2fs.h"
 #include "node.h"
@@ -222,6 +223,36 @@ void f2fs_printk(struct f2fs_sb_info *sbi, const char 
*fmt, ...)
va_end(args);
 }
 
+#ifdef CONFIG_UNICODE
+static const struct f2fs_sb_

[PATCH v4 3/3] f2fs: Support case-insensitive file name lookups

2019-07-23 Thread Daniel Rosenberg
Modeled after commit b886ee3e778e ("ext4: Support case-insensitive file
name lookups")

"""
This patch implements the actual support for case-insensitive file name
lookups in f2fs, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the casefold feature set is able to configure
directories with the +F (F2FS_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string.  This operation is called a
case-insensitive file name lookup.

The feature is configured as an inode attribute applied to directories
and inherited by its children.  This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.

* dcache handling:

For a +F directory, F2Fs only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().

d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries.  This is bad for performance but requires some leveraging of
the vfs layer to fix.  We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.

DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware.  The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file.  This means that case-insensitive
file name lookup will not work only for that file.  An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding.  When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.

* Normalization algorithm:

The UTF-8 algorithms used to compare strings in f2fs is implemented
in fs/unicode, and is based on a previous version developed by
SGI.  It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method for F2FS because:

  - It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
  - It doesn't eliminate important semantic meaning like
compatibility decompositions.

Although:

- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
"""

Signed-off-by: Daniel Rosenberg 
---
 fs/f2fs/dir.c| 126 +++
 fs/f2fs/f2fs.h   |  15 --
 fs/f2fs/file.c   |  16 +-
 fs/f2fs/hash.c   |  35 -
 fs/f2fs/inline.c |   4 +-
 fs/f2fs/inode.c  |   4 +-
 fs/f2fs/namei.c  |  21 
 fs/f2fs/super.c  |   1 +
 8 files changed, 203 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 85a1528f319f2..2913483473f30 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "f2fs.h"
 #include "node.h"
 #include "acl.h"
@@ -81,7 +82,8 @@ static unsigned long dir_block_index(unsigned int level,
return bidx;
 }
 
-static struct f2fs_dir_entry *find_in_block(struct page *dentry_page,
+static struct f2fs_dir_entry *find_in_block(struct inode *dir,
+  

Re: [PATCH v6 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-07-23 Thread Phil Auld
On Tue, Jul 23, 2019 at 05:12:18PM -0500 Dave Chiluk wrote:
> Thanks for all the help and testing you provided.  It's good to know
> these changes have passed at least some scheduler regression tests.
> If it comes to a v7 I'll add the Reviewed-by, otherwise I'll just let
> Peter add it.
>

Sounds good.

> Will you be handling the backport into the RHEL 8 kernels?  I'll
> submit this to Ubuntu and linux-stable once it gets accepted.
>

Yep.

> Thanks again,

Thank you :)


Cheers,
Phil

> 
> 
> On Tue, Jul 23, 2019 at 12:13 PM Phil Auld  wrote:
> >
> > Hi Dave,
> >
> > On Tue, Jul 23, 2019 at 11:44:26AM -0500 Dave Chiluk wrote:
> > > It has been observed, that highly-threaded, non-cpu-bound applications
> > > running under cpu.cfs_quota_us constraints can hit a high percentage of
> > > periods throttled while simultaneously not consuming the allocated
> > > amount of quota. This use case is typical of user-interactive non-cpu
> > > bound applications, such as those running in kubernetes or mesos when
> > > run on multiple cpu cores.
> > >
> > > This has been root caused to cpu-local run queue being allocated per cpu
> > > bandwidth slices, and then not fully using that slice within the period.
> > > At which point the slice and quota expires. This expiration of unused
> > > slice results in applications not being able to utilize the quota for
> > > which they are allocated.
> > >
> > > The non-expiration of per-cpu slices was recently fixed by
> > > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> > > condition")'. Prior to that it appears that this had been broken since
> > > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> > > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> > > added the following conditional which resulted in slices never being
> > > expired.
> > >
> > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> > >   /* extend local deadline, drift is bounded above by 2 ticks */
> > >   cfs_rq->runtime_expires += TICK_NSEC;
> > >
> > > Because this was broken for nearly 5 years, and has recently been fixed
> > > and is now being noticed by many users running kubernetes
> > > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> > > that the mechanisms around expiring runtime should be removed
> > > altogether.
> > >
> > > This allows quota already allocated to per-cpu run-queues to live longer
> > > than the period boundary. This allows threads on runqueues that do not
> > > use much CPU to continue to use their remaining slice over a longer
> > > period of time than cpu.cfs_period_us. However, this helps prevent the
> > > above condition of hitting throttling while also not fully utilizing
> > > your cpu quota.
> > >
> > > This theoretically allows a machine to use slightly more than its
> > > allotted quota in some periods. This overflow would be bounded by the
> > > remaining quota left on each per-cpu runqueueu. This is typically no
> > > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> > > change nothing, as they should theoretically fully utilize all of their
> > > quota in each period. For user-interactive tasks as described above this
> > > provides a much better user/application experience as their cpu
> > > utilization will more closely match the amount they requested when they
> > > hit throttling. This means that cpu limits no longer strictly apply per
> > > period for non-cpu bound applications, but that they are still accurate
> > > over longer timeframes.
> > >
> > > This greatly improves performance of high-thread-count, non-cpu bound
> > > applications with low cfs_quota_us allocation on high-core-count
> > > machines. In the case of an artificial testcase (10ms/100ms of quota on
> > > 80 CPU machine), this commit resulted in almost 30x performance
> > > improvement, while still maintaining correct cpu quota restrictions.
> > > That testcase is available at https://github.com/indeedeng/fibtest.
> > >
> > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift 
> > > condition")
> > > Signed-off-by: Dave Chiluk 
> > > Reviewed-by: Ben Segall 
> >
> > This still works for me. The documentation reads pretty well, too. Good job.
> >
> > Feel free to add my Acked-by: or Reviewed-by: Phil Auld .
> >
> > I'll run it through some more tests when I have time. The code is the same
> > as the earlier one I tested from what I can see.
> >
> > Cheers,
> > Phil
> >
> > > ---
> > >  Documentation/scheduler/sched-bwc.rst | 74 
> > > ---
> > >  kernel/sched/fair.c   | 72 
> > > --
> > >  kernel/sched/sched.h  |  4 --
> > >  3 files changed, 67 insertions(+), 83 deletions(-)
> > >
> > > diff --git a/Documentation/scheduler/sched-bwc.rst 
> > > b/Documentation/scheduler/sched-bwc.rst
> > > index 3a90642..9801d6b 100644
> > > --- a/Documentation/scheduler/sched-

Re: [PATCH v6 3/7] of/platform: Add functional dependency link from DT bindings

2019-07-23 Thread Saravana Kannan
On Tue, Jul 23, 2019 at 3:18 PM Rob Herring  wrote:
>
> On Tue, Jul 23, 2019 at 2:49 PM Saravana Kannan  wrote:
> >
> > On Tue, Jul 23, 2019 at 11:06 AM Rob Herring  wrote:
> > >
> > > On Sat, Jul 20, 2019 at 12:17 AM Saravana Kannan  
> > > wrote:
> > > >
> > > > Add device-links after the devices are created (but before they are
> > > > probed) by looking at common DT bindings like clocks and
> > > > interconnects.
> > >
> > > The structure now looks a lot better to me. A few minor things below.
> >
> > Thanks.
> >
> > > >
> > > > Automatically adding device-links for functional dependencies at the
> > > > framework level provides the following benefits:
> > > >
> > > > - Optimizes device probe order and avoids the useless work of
> > > >   attempting probes of devices that will not probe successfully
> > > >   (because their suppliers aren't present or haven't probed yet).
> > > >
> > > >   For example, in a commonly available mobile SoC, registering just
> > > >   one consumer device's driver at an initcall level earlier than the
> > > >   supplier device's driver causes 11 failed probe attempts before the
> > > >   consumer device probes successfully. This was with a kernel with all
> > > >   the drivers statically compiled in. This problem gets a lot worse if
> > > >   all the drivers are loaded as modules without direct symbol
> > > >   dependencies.
> > > >
> > > > - Supplier devices like clock providers, interconnect providers, etc
> > > >   need to keep the resources they provide active and at a particular
> > > >   state(s) during boot up even if their current set of consumers don't
> > > >   request the resource to be active. This is because the rest of the
> > > >   consumers might not have probed yet and turning off the resource
> > > >   before all the consumers have probed could lead to a hang or
> > > >   undesired user experience.
> > > >
> > > >   Some frameworks (Eg: regulator) handle this today by turning off
> > > >   "unused" resources at late_initcall_sync and hoping all the devices
> > > >   have probed by then. This is not a valid assumption for systems with
> > > >   loadable modules. Other frameworks (Eg: clock) just don't handle
> > > >   this due to the lack of a clear signal for when they can turn off
> > > >   resources. This leads to downstream hacks to handle cases like this
> > > >   that can easily be solved in the upstream kernel.
> > > >
> > > >   By linking devices before they are probed, we give suppliers a clear
> > > >   count of the number of dependent consumers. Once all of the
> > > >   consumers are active, the suppliers can turn off the unused
> > > >   resources without making assumptions about the number of consumers.
> > > >
> > > > By default we just add device-links to track "driver presence" (probe
> > > > succeeded) of the supplier device. If any other functionality provided
> > > > by device-links are needed, it is left to the consumer/supplier
> > > > devices to change the link when they probe.
> > > >
> > > > Signed-off-by: Saravana Kannan 
> > > > ---
> > > >  .../admin-guide/kernel-parameters.txt |   5 +
> > > >  drivers/of/platform.c | 158 ++
> > > >  2 files changed, 163 insertions(+)
> > > >
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > > > b/Documentation/admin-guide/kernel-parameters.txt
> > > > index 138f6664b2e2..109b4310844f 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -3141,6 +3141,11 @@
> > > > This can be set from sysctl after boot.
> > > > See Documentation/sysctl/vm.txt for details.
> > > >
> > > > +   of_devlink  [KNL] Make device links from common DT 
> > > > bindings. Useful
> > > > +   for optimizing probe order and making sure 
> > > > resources
> > > > +   aren't turned off before the consumer devices 
> > > > have
> > > > +   probed.
> > > > +
> > > > ohci1394_dma=early  [HW] enable debugging via the ohci1394 
> > > > driver.
> > > > See Documentation/debugging-via-ohci1394.txt 
> > > > for more
> > > > info.
> > > > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > > > index 04ad312fd85b..88a2086e26fa 100644
> > > > --- a/drivers/of/platform.c
> > > > +++ b/drivers/of/platform.c
> > > > @@ -509,6 +509,163 @@ int of_platform_default_populate(struct 
> > > > device_node *root,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(of_platform_default_populate);
> > > >
> > > > +bool of_link_is_valid(struct device_node *con, struct device_node *sup)
> > > > +{
> > > > +   of_node_get(sup);
> > > > +   /*
> > > > +* Don't allow linking a device node as a consumer of one of its
> > > > +* descendant nodes. By definition, a child node can't be a 
> > > > functional
> > > 

[PATCH v7 3/7] of/platform: Add functional dependency link from DT bindings

2019-07-23 Thread Saravana Kannan
Add device-links after the devices are created (but before they are
probed) by looking at common DT bindings like clocks and
interconnects.

Automatically adding device-links for functional dependencies at the
framework level provides the following benefits:

- Optimizes device probe order and avoids the useless work of
  attempting probes of devices that will not probe successfully
  (because their suppliers aren't present or haven't probed yet).

  For example, in a commonly available mobile SoC, registering just
  one consumer device's driver at an initcall level earlier than the
  supplier device's driver causes 11 failed probe attempts before the
  consumer device probes successfully. This was with a kernel with all
  the drivers statically compiled in. This problem gets a lot worse if
  all the drivers are loaded as modules without direct symbol
  dependencies.

- Supplier devices like clock providers, interconnect providers, etc
  need to keep the resources they provide active and at a particular
  state(s) during boot up even if their current set of consumers don't
  request the resource to be active. This is because the rest of the
  consumers might not have probed yet and turning off the resource
  before all the consumers have probed could lead to a hang or
  undesired user experience.

  Some frameworks (Eg: regulator) handle this today by turning off
  "unused" resources at late_initcall_sync and hoping all the devices
  have probed by then. This is not a valid assumption for systems with
  loadable modules. Other frameworks (Eg: clock) just don't handle
  this due to the lack of a clear signal for when they can turn off
  resources. This leads to downstream hacks to handle cases like this
  that can easily be solved in the upstream kernel.

  By linking devices before they are probed, we give suppliers a clear
  count of the number of dependent consumers. Once all of the
  consumers are active, the suppliers can turn off the unused
  resources without making assumptions about the number of consumers.

By default we just add device-links to track "driver presence" (probe
succeeded) of the supplier device. If any other functionality provided
by device-links are needed, it is left to the consumer/supplier
devices to change the link when they probe.

Signed-off-by: Saravana Kannan 
---
 .../admin-guide/kernel-parameters.txt |   5 +
 drivers/of/platform.c | 165 ++
 2 files changed, 170 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 46b826fcb5ad..12937349d79d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3170,6 +3170,11 @@
This can be set from sysctl after boot.
See Documentation/admin-guide/sysctl/vm.rst for details.
 
+   of_devlink  [KNL] Make device links from common DT bindings. Useful
+   for optimizing probe order and making sure resources
+   aren't turned off before the consumer devices have
+   probed.
+
ohci1394_dma=early  [HW] enable debugging via the ohci1394 driver.
See Documentation/debugging-via-ohci1394.txt for more
info.
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 7801e25e6895..4344419a26fc 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -508,6 +508,170 @@ int of_platform_default_populate(struct device_node *root,
 }
 EXPORT_SYMBOL_GPL(of_platform_default_populate);
 
+bool of_link_is_valid(struct device_node *con, struct device_node *sup)
+{
+   of_node_get(sup);
+   /*
+* Don't allow linking a device node as a consumer of one of its
+* descendant nodes. By definition, a child node can't be a functional
+* dependency for the parent node.
+*/
+   while (sup) {
+   if (sup == con) {
+   of_node_put(sup);
+   return false;
+   }
+   sup = of_get_next_parent(sup);
+   }
+   return true;
+}
+
+static int of_link_to_phandle(struct device *dev, struct device_node *sup_np)
+{
+   struct platform_device *sup_dev;
+   u32 dl_flags = DL_FLAG_AUTOPROBE_CONSUMER;
+   int ret = 0;
+
+   /*
+* Since we are trying to create device links, we need to find
+* the actual device node that owns this supplier phandle.
+* Often times it's the same node, but sometimes it can be one
+* of the parents. So walk up the parent till you find a
+* device.
+*/
+   while (sup_np && !of_find_property(sup_np, "compatible", NULL))
+   sup_np = of_get_next_parent(sup_np);
+   if (!sup_np)
+   return 0;
+
+   if (!of_link_is_valid(dev->of_node, sup_np)) {
+   of_node_p

RE: [PATCH V15 1/5] dt-bindings: fsl: scu: add thermal binding

2019-07-23 Thread Anson Huang
Ping...

> Hi, Daniel/Rui/Eduardo
>   Could you please take a look at this patch series?
> 
> Anson
> 
> > From: Anson Huang 
> >
> > NXP i.MX8QXP is an ARMv8 SoC with a Cortex-M4 core inside as system
> > controller, the system controller is in charge of system power, clock
> > and thermal sensors etc. management, Linux kernel has to communicate
> > with system controller via MU (message unit) IPC to get temperature
> > from thermal sensors, this patch adds binding doc for i.MX system
> > controller thermal driver.
> >
> > Signed-off-by: Anson Huang 
> > Reviewed-by: Rob Herring 
> > Reviewed-by: Dong Aisheng 
> > ---
> > No change.
> > ---
> >  .../devicetree/bindings/arm/freescale/fsl,scu.txt| 16
> 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git
> > a/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt
> > b/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt
> > index a575e42..fc3844e 100644
> > --- a/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt
> > +++ b/Documentation/devicetree/bindings/arm/freescale/fsl,scu.txt
> > @@ -155,6 +155,17 @@ Required properties:
> >  Optional properties:
> >  - timeout-sec: contains the watchdog timeout in seconds.
> >
> > +Thermal bindings based on SCU Message Protocol
> > +
> > +
> > +Required properties:
> > +- compatible:  Should be :
> > + "fsl,imx8qxp-sc-thermal"
> > +   followed by "fsl,imx-sc-thermal";
> > +
> > +- #thermal-sensor-cells:   See
> > Documentation/devicetree/bindings/thermal/thermal.txt
> > +   for a description.
> > +
> >  Example (imx8qxp):
> >  -
> >  aliases {
> > @@ -222,6 +233,11 @@ firmware {
> > compatible = "fsl,imx8qxp-sc-wdt", "fsl,imx-sc-wdt";
> > timeout-sec = <60>;
> > };
> > +
> > +   tsens: thermal-sensor {
> > +   compatible = "fsl,imx8qxp-sc-thermal", "fsl,imx-sc-
> > thermal";
> > +   #thermal-sensor-cells = <1>;
> > +   };
> > };
> >  };
> >
> > --
> > 2.7.4



Re: [PATCH v1 1/2] mm/page_idle: Add support for per-pid page_idle using virtual indexing

2019-07-23 Thread Minchan Kim
On Tue, Jul 23, 2019 at 10:20:49AM -0400, Joel Fernandes wrote:
> On Tue, Jul 23, 2019 at 03:13:58PM +0900, Minchan Kim wrote:
> > Hi Joel,
> > 
> > On Mon, Jul 22, 2019 at 05:32:04PM -0400, Joel Fernandes (Google) wrote:
> > > The page_idle tracking feature currently requires looking up the pagemap
> > > for a process followed by interacting with /sys/kernel/mm/page_idle.
> > > This is quite cumbersome and can be error-prone too. If between
> > 
> > cumbersome: That's the fair tradeoff between idle page tracking and
> > clear_refs because idle page tracking could check even though the page
> > is not mapped.
> 
> It is fair tradeoff, but could be made simpler. The userspace code got
> reduced by a good amount as well.
> 
> > error-prone: What's the error?
> 
> We see in normal Android usage, that some of the times pages appear not to be
> idle even when they really are idle. Reproducing this is a bit unpredictable
> and happens at random occasions. With this new interface, we are seeing this
> happen much much lesser.

I don't know how you did test. Maybe that could be contributed by
swapping out or shared pages touched by other processes or some kernel
behavior not to keep access bit of their operation.
Please investigate more what's the root cause. That would be important
point to justify for the patch motivation.

> 
> > > accessing the per-PID pagemap and the global page_idle bitmap, if
> > > something changes with the page then the information is not accurate.
> > 
> > What you mean with error is this timing issue?
> > Why do you need to be accurate? IOW, accurate is always good but what's
> > the scale of the accuracy?
> 
> There is a time window between looking up pagemap and checking if page is
> idle. Anyway, see below for the primary goals as you asked:
> 
> > > More over looking up PFN from pagemap in Android devices is not
> > > supported by unprivileged process and requires SYS_ADMIN and gives 0 for
> > > the PFN.
> > > 
> > > This patch adds support to directly interact with page_idle tracking at
> > > the PID level by introducing a /proc//page_idle file. This
> > > eliminates the need for userspace to calculate the mapping of the page.
> > > It follows the exact same semantics as the global
> > > /sys/kernel/mm/page_idle, however it is easier to use for some usecases
> > > where looking up PFN is not needed and also does not require SYS_ADMIN.
> > 
> > Ah, so the primary goal is to provide convinience interface and it would
> > help accurary, too. IOW, accuracy is not your main goal?
> 
> There are a couple of primary goals: Security, conveience and also solving
> the accuracy/reliability problem we are seeing. Do keep in mind looking up
> PFN has security implications. The PFN field in pagemap is zeroed if the user
> does not have CAP_SYS_ADMIN.

Myaybe you don't need PFN. is it?

> 
> > > In Android, we are using this for the heap profiler (heapprofd) which
> > > profiles and pin points code paths which allocates and leaves memory
> > > idle for long periods of time.
> > 
> > So the goal is to detect idle pages with idle memory tracking?
> 
> Isn't that what idle memory tracking does?

To me, it's rather misleading. Please read motivation section in document.
The feature would be good to detect workingset pages, not idle pages
because workingset pages are never freed, swapped out and even we could
count on newly allocated pages.

Motivation
==

The idle page tracking feature allows to track which memory pages are being
accessed by a workload and which are idle. This information can be useful for
estimating the workload's working set size, which, in turn, can be taken into
account when configuring the workload parameters, setting memory cgroup limits,
or deciding where to place the workload within a compute cluster.

> 
> > It couldn't work well because such idle pages could finally swap out and
> > lose every flags of the page descriptor which is working mechanism of
> > idle page tracking. It should have named "workingset page tracking",
> > not "idle page tracking".
> 
> The heap profiler that uses page-idle tracking is not to measure working set,
> but to look for pages that are idle for long periods of time.

It's important part. Please include it in the description so that people
understands what's the usecase. As I said above, if it aims for finding
idle pages durting the period, current idle page tracking feature is not
good ironically.

> 
> Thanks for bringing up the swapping corner case..  Perhaps we can improve
> the heap profiler to detect this by looking at bits 0-4 in pagemap. While it

Yeb, that could work but it could add overhead again what you want to remove?
Even, userspace should keep metadata to identify that page was already swapped
in last period or newly swapped in new period.

> is true that we would lose access information during the window, there is a
> high likelihood that the page was not accessed which is why it was swapped.
> Thoughts?

It depends on