from:"Suren Baghdasaryan"

[PATCH v2 00/39] Memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

_PROFILING - enables memory allocation profiling.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - enables memory allocation
profiling by default.
CONFIG_MEM_ALLOC_PROFILING_DEBUG - enables memory allocation profiling
validation.
Note: CONFIG_MEM_ALLOC_PROFILING enables CONFIG_PAGE_EXTENSION to store
code tag reference in the page_ext object.

/proc/sys/vm/mem_profiling sysctl is provided to enable/disable the
functionality and avoid the performance overhead.

Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING &
!CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING &
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING &
!CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT & /proc/sys/vm/mem_profiling=1)
(5) Memcg (CONFIG_MEMCG_KMEM)
(6) Enabled by default with memcg (CONFIG_MEM_ALLOC_PROFILING &
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT & CONFIG_MEMCG_KMEM)

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below is performance
comparison between the baseline kernel, profiling when enabled, profiling
when disabled and (for comparison purposes) baseline with
CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:

kmalloc pgalloc
(1 baseline)12.041s 49.190s
(2 default disabled)14.970s (+24.33%)   49.684s (+1.00%)
(3 default enabled) 16.859s (+40.01%)   56.287s (+14.43%)
(4 runtime enabled) 16.983s (+41.04%)   55.760s (+13.36%)
(5 memcg)   33.831s (+180.96%)  51.433s (+4.56%)
(6 enabled & memcg) 39.145s (+225.10%)  56.874s (+15.62%)

Memory overhead:
Kernel size:

   text   databssdechex
(1) 32638461  1828642618325508   69250395   420ad5b
(2) 32710110  1864658618071556   69428252   423641c
(3) 32706918  1864658618071556   69425060   42357a4
(4) 32709664  1864658618071556   69427806   423625e
(5) 32715893  1834533418239492   69300719   42171ef
(6) 32786068  1870195817993732   69481758   424351e

Memory consumption on a 56 core Intel CPU with 125GB of memory running
Fedora:
Code tags:   192 kB
PageExts: 262144 kB (256MB)
SlabExts:   9876 kB (9.6MB)
PcpuExts:512 kB (0.5MB)

Total overhead is 0.2% of total memory.

[1] https://lore.kernel.org/all/20230501165450.15352-1-sur...@google.com/
[2] 
https://docs.google.com/presentation/d/1zQnuMbEfcq9lHUXgJRUZsd1McRAkr3Xq6Wk693YA0To/edit?usp=sharing
[3] https://lore.kernel.org/all/20220830214919.53220-1-sur...@google.com/

[example 1]:
typedef struct codetag {
  const char* file;
  int line;
  int counter;
} codetag;

void my_trampoline(func_ptr func, ...) {
  static codetag callsite_data __section("alloc_tags") =
{ __callsite_FILE, __callsite_LINE, 0 };
  callsite_data.counter++;
  func(...);
}

__callsite_wrapper(my_trampoline)
__attribute__ ((always_inline))
static inline void foo1(void) {
  printf("foo1 function\n");
}

__callsite_wrapper(my_trampoline)
__attribute__ ((always_inline))
static inline void foo2(void) {
  printf("foo2 function\n");
}

void bar(void) {
  foo1();
}

int main(int argc, char** argv) {
  foo1();
  foo2();
  bar();
  return 0;
}

Kent Overstreet (16):
  lib/string_helpers: Add flags param to string_get_size()
  scripts/kallysms: Always include __start and __stop symbols
  fs: Convert alloc_inode_sb() to a macro
  nodemask: Split out include/linux/nodemask_types.h
  prandom: Remove unused include
  change alloc_pages name in ivpu_bo_ops to avoid conflicts
  mm/slub: Mark slab_free_freelist_hook() __always_inline
  mempool: Hook up to memory allocation profiling
  xfs: Memory allocation profiling fixups
  timekeeping: Fix a circular include dependency
  mm: percpu: Introduce pcpuobj_ext
  mm: percpu: Add codetag reference into pcpuobj_ext
  arm64: Fix circular header dependency
  mm: vmalloc: Enable memory allocation profiling
  rhashtable: Plumb through alloc tag
  MAINTAINERS: Add entries for code tagging and memory allocation
profiling

Suren Baghdasaryan (23):
  mm: enumerate all gfp flags
  mm: introduce slabobj_ext to support slab object extensions
  mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
creation
  mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
  mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
objects
  slab: objext: introduce objext_flags as extension to
page_memcg_data_flags
  lib: code

[PATCH v2 01/39] lib/string_helpers: Add flags param to string_get_size()

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

The new flags parameter allows controlling
 - Whether or not the units suffix is separated by a space, for
   compatibility with sort -h
 - Whether or not to append a B suffix - we're not always printing
   bytes.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Cc: Andy Shevchenko 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "Noralf Trønnes" 
Cc: Jens Axboe 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  2 +-
 drivers/block/virtio_blk.c|  4 ++--
 drivers/gpu/drm/gud/gud_drv.c |  2 +-
 drivers/mmc/core/block.c  |  4 ++--
 drivers/mtd/spi-nor/debugfs.c |  6 ++---
 .../ethernet/chelsio/cxgb4/cxgb4_debugfs.c|  4 ++--
 drivers/scsi/sd.c |  8 +++
 include/linux/string_helpers.h| 13 +-
 lib/string_helpers.c  | 24 +--
 lib/test-string_helpers.c |  4 ++--
 mm/hugetlb.c  |  8 +++
 11 files changed, 44 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c6a4ac766b2b..27aa5a083ff0 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -260,7 +260,7 @@ print_mapping(unsigned long start, unsigned long end, 
unsigned long size, bool e
if (end <= start)
return;
 
-   string_get_size(size, 1, STRING_UNITS_2, buf, sizeof(buf));
+   string_get_size(size, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
 
pr_info("Mapped 0x%016lx-0x%016lx with %s pages%s\n", start, end, buf,
exec ? " (exec)" : "");
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1fe011676d07..59140424d755 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -986,9 +986,9 @@ static void virtblk_update_capacity(struct virtio_blk 
*vblk, bool resize)
nblocks = DIV_ROUND_UP_ULL(capacity, queue_logical_block_size(q) >> 9);
 
string_get_size(nblocks, queue_logical_block_size(q),
-   STRING_UNITS_2, cap_str_2, sizeof(cap_str_2));
+   STRING_SIZE_BASE2, cap_str_2, sizeof(cap_str_2));
string_get_size(nblocks, queue_logical_block_size(q),
-   STRING_UNITS_10, cap_str_10, sizeof(cap_str_10));
+   0, cap_str_10, sizeof(cap_str_10));
 
dev_notice(&vdev->dev,
   "[%s] %s%llu %d-byte logical blocks (%s/%s)\n",
diff --git a/drivers/gpu/drm/gud/gud_drv.c b/drivers/gpu/drm/gud/gud_drv.c
index 9d7bf8ee45f1..6b1748e1f666 100644
--- a/drivers/gpu/drm/gud/gud_drv.c
+++ b/drivers/gpu/drm/gud/gud_drv.c
@@ -329,7 +329,7 @@ static int gud_stats_debugfs(struct seq_file *m, void *data)
struct gud_device *gdrm = to_gud_device(entry->dev);
char buf[10];
 
-   string_get_size(gdrm->bulk_len, 1, STRING_UNITS_2, buf, sizeof(buf));
+   string_get_size(gdrm->bulk_len, 1, STRING_SIZE_BASE2, buf, sizeof(buf));
seq_printf(m, "Max buffer size: %s\n", buf);
seq_printf(m, "Number of errors:  %u\n", gdrm->stats_num_errors);
 
diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 3a8f27c3e310..411dc8137f7c 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2511,7 +2511,7 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct 
mmc_card *card,
 
blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled);
 
-   string_get_size((u64)size, 512, STRING_UNITS_2,
+   string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));
pr_info("%s: %s %s %s%s\n",
md->disk->disk_name, mmc_card_id(card), mmc_card_name(card),
@@ -2707,7 +2707,7 @@ static int mmc_blk_alloc_rpmb_part(struct mmc_card *card,
 
list_add(&rpmb->node, &md->rpmbs);
 
-   string_get_size((u64)size, 512, STRING_UNITS_2,
+   string_get_size((u64)size, 512, STRING_SIZE_BASE2,
cap_str, sizeof(cap_str));
 
pr_info("%s: %s %s %s, chardev (%d:%d)\n",
diff --git a/drivers/mtd/spi-nor/debugfs.c b/drivers/mtd/spi-nor/debugfs.c
index 6e163cb5b478..a1b61938fee2 100644
--- a/drivers/mtd/spi-nor/debugfs.c
+++ b/drivers/mtd/spi-nor/debugfs.c
@@ -85,7 +85,7 @@ static int spi_nor_params_show(struct seq_file *s, void *data)
 
seq_printf(s, "name\t\t%s\n", info->name);
seq_printf(s, "id\t\t%*ph\n", SPI_NOR_MAX_ID_LEN, nor->id);
-   string_get_size(params->size, 1, STRING_UNITS_2, buf, sizeof(buf));
+   string_get_size(params->size, 1, STR

[PATCH v2 02/39] scripts/kallysms: Always include start and stop symbols

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

These symbols are used to denote section boundaries: by always including
them we can unify loading sections from modules with loading built-in
sections, which leads to some significant cleanup.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 scripts/kallsyms.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 653b92f6d4c8..47978efe4797 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -204,6 +204,11 @@ static int symbol_in_range(const struct sym_entry *s,
return 0;
 }
 
+static bool string_starts_with(const char *s, const char *prefix)
+{
+   return strncmp(s, prefix, strlen(prefix)) == 0;
+}
+
 static int symbol_valid(const struct sym_entry *s)
 {
const char *name = sym_name(s);
@@ -211,6 +216,14 @@ static int symbol_valid(const struct sym_entry *s)
/* if --all-symbols is not specified, then symbols outside the text
 * and inittext sections are discarded */
if (!all_symbols) {
+   /*
+* Symbols starting with __start and __stop are used to denote
+* section boundaries, and should always be included:
+*/
+   if (string_starts_with(name, "__start_") ||
+   string_starts_with(name, "__stop_"))
+   return 1;
+
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
return 0;
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 03/39] fs: Convert alloc_inode_sb() to a macro

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

We're introducing alloc tagging, which tracks memory allocations by
callsite. Converting alloc_inode_sb() to a macro means allocations will
be tracked by its caller, which is a bit more useful.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Cc: Alexander Viro 
---
 include/linux/fs.h | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a40823c3c67..c545b1839e96 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2862,11 +2862,7 @@ int setattr_should_drop_sgid(struct mnt_idmap *idmap,
  * This must be used for allocating filesystems specific inodes to set
  * up the inode reclaim context correctly.
  */
-static inline void *
-alloc_inode_sb(struct super_block *sb, struct kmem_cache *cache, gfp_t gfp)
-{
-   return kmem_cache_alloc_lru(cache, &sb->s_inode_lru, gfp);
-}
+#define alloc_inode_sb(_sb, _cache, _gfp) kmem_cache_alloc_lru(_cache, 
&_sb->s_inode_lru, _gfp)
 
 extern void __insert_inode_hash(struct inode *, unsigned long hashval);
 static inline void insert_inode_hash(struct inode *inode)
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 04/39] nodemask: Split out include/linux/nodemask_types.h

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

sched.h, which defines task_struct, needs nodemask_t - but sched.h is a
frequently used header and ideally shouldn't be pulling in any more code
that it needs to.

This splits out nodemask_types.h which has the definition sched.h needs,
which will avoid a circular header dependency in the alloc tagging patch
series, and as a bonus should speed up kernel build times.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
---
 include/linux/nodemask.h   | 2 +-
 include/linux/nodemask_types.h | 9 +
 include/linux/sched.h  | 2 +-
 3 files changed, 11 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/nodemask_types.h

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 8d07116caaf1..b61438313a73 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -93,10 +93,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
 extern nodemask_t _unused_nodemask_arg_;
 
 /**
diff --git a/include/linux/nodemask_types.h b/include/linux/nodemask_types.h
new file mode 100644
index ..84c2f47c4237
--- /dev/null
+++ b/include/linux/nodemask_types.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_NODEMASK_TYPES_H
+#define __LINUX_NODEMASK_TYPES_H
+
+#include 
+
+typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
+
+#endif /* __LINUX_NODEMASK_TYPES_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 77f01ac385f7..12a2554a3164 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 05/39] prandom: Remove unused include

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

prandom.h doesn't use percpu.h - this fixes some circular header issues.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/prandom.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index f2ed5b72b3d6..f7f1e5251c67 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -10,7 +10,6 @@
 
 #include 
 #include 
-#include 
 #include 
 
 struct rnd_state {
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 06/39] mm: enumerate all gfp flags

2023-10-24 Thread Suren Baghdasaryan

Introduce GFP bits enumeration to let compiler track the number of used
bits (which depends on the config options) instead of hardcoding them.
That simplifies __GFP_BITS_SHIFT calculation.

Suggested-by: Petr Tesařík 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/gfp_types.h | 90 +++
 1 file changed, 62 insertions(+), 28 deletions(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 6583a58670c5..3fbe624763d9 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -21,44 +21,78 @@ typedef unsigned int __bitwise gfp_t;
  * include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
  */
 
+enum {
+   ___GFP_DMA_BIT,
+   ___GFP_HIGHMEM_BIT,
+   ___GFP_DMA32_BIT,
+   ___GFP_MOVABLE_BIT,
+   ___GFP_RECLAIMABLE_BIT,
+   ___GFP_HIGH_BIT,
+   ___GFP_IO_BIT,
+   ___GFP_FS_BIT,
+   ___GFP_ZERO_BIT,
+   ___GFP_UNUSED_BIT,  /* 0x200u unused */
+   ___GFP_DIRECT_RECLAIM_BIT,
+   ___GFP_KSWAPD_RECLAIM_BIT,
+   ___GFP_WRITE_BIT,
+   ___GFP_NOWARN_BIT,
+   ___GFP_RETRY_MAYFAIL_BIT,
+   ___GFP_NOFAIL_BIT,
+   ___GFP_NORETRY_BIT,
+   ___GFP_MEMALLOC_BIT,
+   ___GFP_COMP_BIT,
+   ___GFP_NOMEMALLOC_BIT,
+   ___GFP_HARDWALL_BIT,
+   ___GFP_THISNODE_BIT,
+   ___GFP_ACCOUNT_BIT,
+   ___GFP_ZEROTAGS_BIT,
+#ifdef CONFIG_KASAN_HW_TAGS
+   ___GFP_SKIP_ZERO_BIT,
+   ___GFP_SKIP_KASAN_BIT,
+#endif
+#ifdef CONFIG_LOCKDEP
+   ___GFP_NOLOCKDEP_BIT,
+#endif
+   ___GFP_LAST_BIT
+};
+
 /* Plain integer GFP bitmasks. Do not use this directly. */
-#define ___GFP_DMA 0x01u
-#define ___GFP_HIGHMEM 0x02u
-#define ___GFP_DMA32   0x04u
-#define ___GFP_MOVABLE 0x08u
-#define ___GFP_RECLAIMABLE 0x10u
-#define ___GFP_HIGH0x20u
-#define ___GFP_IO  0x40u
-#define ___GFP_FS  0x80u
-#define ___GFP_ZERO0x100u
+#define ___GFP_DMA BIT(___GFP_DMA_BIT)
+#define ___GFP_HIGHMEM BIT(___GFP_HIGHMEM_BIT)
+#define ___GFP_DMA32   BIT(___GFP_DMA32_BIT)
+#define ___GFP_MOVABLE BIT(___GFP_MOVABLE_BIT)
+#define ___GFP_RECLAIMABLE BIT(___GFP_RECLAIMABLE_BIT)
+#define ___GFP_HIGHBIT(___GFP_HIGH_BIT)
+#define ___GFP_IO  BIT(___GFP_IO_BIT)
+#define ___GFP_FS  BIT(___GFP_FS_BIT)
+#define ___GFP_ZEROBIT(___GFP_ZERO_BIT)
 /* 0x200u unused */
-#define ___GFP_DIRECT_RECLAIM  0x400u
-#define ___GFP_KSWAPD_RECLAIM  0x800u
-#define ___GFP_WRITE   0x1000u
-#define ___GFP_NOWARN  0x2000u
-#define ___GFP_RETRY_MAYFAIL   0x4000u
-#define ___GFP_NOFAIL  0x8000u
-#define ___GFP_NORETRY 0x1u
-#define ___GFP_MEMALLOC0x2u
-#define ___GFP_COMP0x4u
-#define ___GFP_NOMEMALLOC  0x8u
-#define ___GFP_HARDWALL0x10u
-#define ___GFP_THISNODE0x20u
-#define ___GFP_ACCOUNT 0x40u
-#define ___GFP_ZEROTAGS0x80u
+#define ___GFP_DIRECT_RECLAIM  BIT(___GFP_DIRECT_RECLAIM_BIT)
+#define ___GFP_KSWAPD_RECLAIM  BIT(___GFP_KSWAPD_RECLAIM_BIT)
+#define ___GFP_WRITE   BIT(___GFP_WRITE_BIT)
+#define ___GFP_NOWARN  BIT(___GFP_NOWARN_BIT)
+#define ___GFP_RETRY_MAYFAIL   BIT(___GFP_RETRY_MAYFAIL_BIT)
+#define ___GFP_NOFAIL  BIT(___GFP_NOFAIL_BIT)
+#define ___GFP_NORETRY BIT(___GFP_NORETRY_BIT)
+#define ___GFP_MEMALLOCBIT(___GFP_MEMALLOC_BIT)
+#define ___GFP_COMPBIT(___GFP_COMP_BIT)
+#define ___GFP_NOMEMALLOC  BIT(___GFP_NOMEMALLOC_BIT)
+#define ___GFP_HARDWALLBIT(___GFP_HARDWALL_BIT)
+#define ___GFP_THISNODEBIT(___GFP_THISNODE_BIT)
+#define ___GFP_ACCOUNT BIT(___GFP_ACCOUNT_BIT)
+#define ___GFP_ZEROTAGSBIT(___GFP_ZEROTAGS_BIT)
 #ifdef CONFIG_KASAN_HW_TAGS
-#define ___GFP_SKIP_ZERO   0x100u
-#define ___GFP_SKIP_KASAN  0x200u
+#define ___GFP_SKIP_ZERO   BIT(___GFP_SKIP_ZERO_BIT)
+#define ___GFP_SKIP_KASAN  BIT(___GFP_SKIP_KASAN_BIT)
 #else
 #define ___GFP_SKIP_ZERO   0
 #define ___GFP_SKIP_KASAN  0
 #endif
 #ifdef CONFIG_LOCKDEP
-#define ___GFP_NOLOCKDEP   0x400u
+#define ___GFP_NOLOCKDEP   BIT(___GFP_NOLOCKDEP_BIT)
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
-/* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -249,7 +283,7 @@ typedef unsigned int __bitwise gfp_t;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT ___GFP_LAST_BIT
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 07/39] mm: introduce slabobj_ext to support slab object extensions

2023-10-24 Thread Suren Baghdasaryan

Currently slab pages can store only vectors of obj_cgroup pointers in
page->memcg_data. Introduce slabobj_ext structure to allow more data
to be stored for each slab object. Wrap obj_cgroup into slabobj_ext
to support current functionality while allowing to extend slabobj_ext
in the future.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/memcontrol.h |  20 +++--
 include/linux/mm_types.h   |   4 +-
 init/Kconfig   |   4 +
 mm/kfence/core.c   |  14 ++--
 mm/kfence/kfence.h |   4 +-
 mm/memcontrol.c|  56 ++
 mm/page_owner.c|   2 +-
 mm/slab.h  | 148 +
 mm/slab_common.c   |  47 
 9 files changed, 185 insertions(+), 114 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e4e24da16d2c..4b17ebb7e723 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -346,8 +346,8 @@ struct mem_cgroup {
 extern struct mem_cgroup *root_mem_cgroup;
 
 enum page_memcg_data_flags {
-   /* page->memcg_data is a pointer to an objcgs vector */
-   MEMCG_DATA_OBJCGS = (1UL << 0),
+   /* page->memcg_data is a pointer to an slabobj_ext vector */
+   MEMCG_DATA_OBJEXTS = (1UL << 0),
/* page has been accounted as a non-slab kernel page */
MEMCG_DATA_KMEM = (1UL << 1),
/* the next bit after the last actual flag */
@@ -385,7 +385,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio 
*folio)
unsigned long memcg_data = folio->memcg_data;
 
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
-   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
 
return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -406,7 +406,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio 
*folio)
unsigned long memcg_data = folio->memcg_data;
 
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
-   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio);
+   VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
 
return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
@@ -503,7 +503,7 @@ static inline struct mem_cgroup *folio_memcg_check(struct 
folio *folio)
 */
unsigned long memcg_data = READ_ONCE(folio->memcg_data);
 
-   if (memcg_data & MEMCG_DATA_OBJCGS)
+   if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;
 
if (memcg_data & MEMCG_DATA_KMEM) {
@@ -549,7 +549,7 @@ static inline struct mem_cgroup 
*get_mem_cgroup_from_objcg(struct obj_cgroup *ob
 static inline bool folio_memcg_kmem(struct folio *folio)
 {
VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page);
-   VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio);
+   VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJEXTS, folio);
return folio->memcg_data & MEMCG_DATA_KMEM;
 }
 
@@ -1593,6 +1593,14 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
*pgdat, int order,
 }
 #endif /* CONFIG_MEMCG */
 
+/*
+ * Extended information for slab objects stored as an array in page->memcg_data
+ * if MEMCG_DATA_OBJEXTS is set.
+ */
+struct slabobj_ext {
+   struct obj_cgroup *objcg;
+} __aligned(8);
+
 static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
 {
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 36c5b43999e6..5b55c4752c23 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -180,7 +180,7 @@ struct page {
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
 
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
 #endif
 
@@ -315,7 +315,7 @@ struct folio {
};
atomic_t _mapcount;
atomic_t _refcount;
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_SLAB_OBJ_EXT
unsigned long memcg_data;
 #endif
/* private: the union with struct page is transitional */
diff --git a/init/Kconfig b/init/Kconfig
index 6d35728b94b2..78a7abe36037 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -937,10 +937,14 @@ config CGROUP_FAVOR_DYNMODS
 
   Say N if unsure.
 
+config SLAB_OBJ_EXT
+   bool
+
 config MEMCG
bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
+   select SLAB_OBJ_EXT
help
  Provides control over the memory footprint of tasks in a cgroup.
 
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 3872528d0963..02b744d2e07d 100644
--- a/mm/kfenc

[PATCH v2 08/39] mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation

2023-10-24 Thread Suren Baghdasaryan

Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations
when allocating slabobj_ext on a slab.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/gfp_types.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 3fbe624763d9..1c6573d69347 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -52,6 +52,9 @@ enum {
 #endif
 #ifdef CONFIG_LOCKDEP
___GFP_NOLOCKDEP_BIT,
+#endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+   ___GFP_NO_OBJ_EXT_BIT,
 #endif
___GFP_LAST_BIT
 };
@@ -93,6 +96,11 @@ enum {
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
+#ifdef CONFIG_SLAB_OBJ_EXT
+#define ___GFP_NO_OBJ_EXT   BIT(___GFP_NO_OBJ_EXT_BIT)
+#else
+#define ___GFP_NO_OBJ_EXT   0
+#endif
 
 /*
  * Physical address zone modifiers (see linux/mmzone.h - low four bits)
@@ -133,12 +141,15 @@ enum {
  * node with no fallbacks or placement policy enforcements.
  *
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_NO_OBJ_EXT causes slab allocation to have no object extension.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE((__force gfp_t)___GFP_WRITE)
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
 #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT  ((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_NO_OBJ_EXT   ((__force gfp_t)___GFP_NO_OBJ_EXT)
 
 /**
  * DOC: Watermark modifiers
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 10/39] mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache objects

2023-10-24 Thread Suren Baghdasaryan

Use __GFP_NO_OBJ_EXT to prevent recursions when allocating slabobj_ext
objects. Also prevent slabobj_ext allocations for kmem_cache objects.

Signed-off-by: Suren Baghdasaryan 
---
 mm/slab.h| 6 ++
 mm/slab_common.c | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 5a47125469f1..187acc593397 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -489,6 +489,12 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t 
flags, void *p)
if (!need_slab_obj_ext())
return NULL;
 
+   if (s->flags & SLAB_NO_OBJ_EXT)
+   return NULL;
+
+   if (flags & __GFP_NO_OBJ_EXT)
+   return NULL;
+
slab = virt_to_slab(p);
if (!slab_obj_exts(slab) &&
WARN(alloc_slab_obj_exts(slab, s, flags, false),
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 2b42a9d2c11c..446f406d2703 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -222,6 +222,8 @@ int alloc_slab_obj_exts(struct slab *slab, struct 
kmem_cache *s,
void *vec;
 
gfp &= ~OBJCGS_CLEAR_MASK;
+   /* Prevent recursive extension vector allocation */
+   gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
   slab_nid(slab));
if (!vec)
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 09/39] mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation

2023-10-24 Thread Suren Baghdasaryan

Slab extension objects can't be allocated before slab infrastructure is
initialized. Some caches, like kmem_cache and kmem_cache_node, are created
before slab infrastructure is initialized. Objects from these caches can't
have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these
caches and avoid creating extensions for objects allocated from these
slabs.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/slab.h | 7 +++
 mm/slab.c| 2 +-
 mm/slub.c| 5 +++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8228d1276a2f..11ef3d364b2b 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -164,6 +164,13 @@
 #endif
 #define SLAB_TEMPORARY SLAB_RECLAIM_ACCOUNT/* Objects are 
short-lived */
 
+#ifdef CONFIG_SLAB_OBJ_EXT
+/* Slab created using create_boot_cache */
+#define SLAB_NO_OBJ_EXT ((slab_flags_t __force)0x2000U)
+#else
+#define SLAB_NO_OBJ_EXT 0
+#endif
+
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
diff --git a/mm/slab.c b/mm/slab.c
index 9ad3d0f2d1a5..cefcb7499b6c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1232,7 +1232,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
  nr_node_ids * sizeof(struct kmem_cache_node 
*),
- SLAB_HWCACHE_ALIGN, 0, 0);
+ SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
list_add(&kmem_cache->list, &slab_caches);
slab_state = PARTIAL;
 
diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..d16643492320 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5043,7 +5043,8 @@ void __init kmem_cache_init(void)
node_set(node, slab_nodes);
 
create_boot_cache(kmem_cache_node, "kmem_cache_node",
-   sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN, 0, 0);
+   sizeof(struct kmem_cache_node),
+   SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI);
 
@@ -5053,7 +5054,7 @@ void __init kmem_cache_init(void)
create_boot_cache(kmem_cache, "kmem_cache",
offsetof(struct kmem_cache, node) +
nr_node_ids * sizeof(struct kmem_cache_node *),
-  SLAB_HWCACHE_ALIGN, 0, 0);
+   SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
kmem_cache = bootstrap(&boot_kmem_cache);
kmem_cache_node = bootstrap(&boot_kmem_cache_node);
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 11/39] slab: objext: introduce objext_flags as extension to page_memcg_data_flags

2023-10-24 Thread Suren Baghdasaryan

Introduce objext_flags to store additional objext flags unrelated to memcg.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/memcontrol.h | 29 ++---
 mm/slab.h  |  4 +---
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b17ebb7e723..f3ede28b6fa6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,7 +354,22 @@ enum page_memcg_data_flags {
__NR_MEMCG_DATA_FLAGS  = (1UL << 2),
 };
 
-#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
+#define __FIRST_OBJEXT_FLAG__NR_MEMCG_DATA_FLAGS
+
+#else /* CONFIG_MEMCG */
+
+#define __FIRST_OBJEXT_FLAG(1UL << 0)
+
+#endif /* CONFIG_MEMCG */
+
+enum objext_flags {
+   /* the next bit after the last actual flag */
+   __NR_OBJEXTS_FLAGS  = __FIRST_OBJEXT_FLAG,
+};
+
+#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
+
+#ifdef CONFIG_MEMCG
 
 static inline bool folio_memcg_kmem(struct folio *folio);
 
@@ -388,7 +403,7 @@ static inline struct mem_cgroup *__folio_memcg(struct folio 
*folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
 
-   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
 
 /*
@@ -409,7 +424,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio 
*folio)
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
 
-   return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
 
 /*
@@ -466,11 +481,11 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct 
folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
 
-   objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
 
-   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
 
 /*
@@ -509,11 +524,11 @@ static inline struct mem_cgroup *folio_memcg_check(struct 
folio *folio)
if (memcg_data & MEMCG_DATA_KMEM) {
struct obj_cgroup *objcg;
 
-   objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
return obj_cgroup_memcg(objcg);
}
 
-   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+   return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
 
 static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/slab.h b/mm/slab.h
index 187acc593397..60417fd262ea 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -448,10 +448,8 @@ static inline struct slabobj_ext *slab_obj_exts(struct 
slab *slab)
slab_page(slab));
VM_BUG_ON_PAGE(obj_exts & MEMCG_DATA_KMEM, slab_page(slab));
 
-   return (struct slabobj_ext *)(obj_exts & ~MEMCG_DATA_FLAGS_MASK);
-#else
-   return (struct slabobj_ext *)obj_exts;
 #endif
+   return (struct slabobj_ext *)(obj_exts & ~OBJEXTS_FLAGS_MASK);
 }
 
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 12/39] lib: code tagging framework

2023-10-24 Thread Suren Baghdasaryan

Add basic infrastructure to support code tagging which stores tag common
information consisting of the module name, function, file name and line
number. Provide functions to register a new code tag type and navigate
between code tags.

Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/codetag.h |  71 ++
 lib/Kconfig.debug   |   4 +
 lib/Makefile|   1 +
 lib/codetag.c   | 199 
 4 files changed, 275 insertions(+)
 create mode 100644 include/linux/codetag.h
 create mode 100644 lib/codetag.c

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
new file mode 100644
index ..a9d7adecc2a5
--- /dev/null
+++ b/include/linux/codetag.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * code tagging framework
+ */
+#ifndef _LINUX_CODETAG_H
+#define _LINUX_CODETAG_H
+
+#include 
+
+struct codetag_iterator;
+struct codetag_type;
+struct seq_buf;
+struct module;
+
+/*
+ * An instance of this structure is created in a special ELF section at every
+ * code location being tagged.  At runtime, the special section is treated as
+ * an array of these.
+ */
+struct codetag {
+   unsigned int flags; /* used in later patches */
+   unsigned int lineno;
+   const char *modname;
+   const char *function;
+   const char *filename;
+} __aligned(8);
+
+union codetag_ref {
+   struct codetag *ct;
+};
+
+struct codetag_range {
+   struct codetag *start;
+   struct codetag *stop;
+};
+
+struct codetag_module {
+   struct module *mod;
+   struct codetag_range range;
+};
+
+struct codetag_type_desc {
+   const char *section;
+   size_t tag_size;
+};
+
+struct codetag_iterator {
+   struct codetag_type *cttype;
+   struct codetag_module *cmod;
+   unsigned long mod_id;
+   struct codetag *ct;
+};
+
+#define CODE_TAG_INIT {\
+   .modname= KBUILD_MODNAME,   \
+   .function   = __func__, \
+   .filename   = __FILE__, \
+   .lineno = __LINE__, \
+   .flags  = 0,\
+}
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock);
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype);
+struct codetag *codetag_next_ct(struct codetag_iterator *iter);
+
+void codetag_to_text(struct seq_buf *out, struct codetag *ct);
+
+struct codetag_type *
+codetag_register_type(const struct codetag_type_desc *desc);
+
+#endif /* _LINUX_CODETAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index fa307f93fa2e..2acbef24e93e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -962,6 +962,10 @@ config DEBUG_STACKOVERFLOW
 
  If in doubt, say "N".
 
+config CODE_TAGGING
+   bool
+   select KALLSYMS
+
 source "lib/Kconfig.kasan"
 source "lib/Kconfig.kfence"
 source "lib/Kconfig.kmsan"
diff --git a/lib/Makefile b/lib/Makefile
index 740109b6e2c8..b50212b5b999 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -233,6 +233,7 @@ obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \
of-reconfig-notifier-error-inject.o
 obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
 
+obj-$(CONFIG_CODE_TAGGING) += codetag.o
 lib-$(CONFIG_GENERIC_BUG) += bug.o
 
 obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
diff --git a/lib/codetag.c b/lib/codetag.c
new file mode 100644
index ..7708f8388e55
--- /dev/null
+++ b/lib/codetag.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct codetag_type {
+   struct list_head link;
+   unsigned int count;
+   struct idr mod_idr;
+   struct rw_semaphore mod_lock; /* protects mod_idr */
+   struct codetag_type_desc desc;
+};
+
+static DEFINE_MUTEX(codetag_lock);
+static LIST_HEAD(codetag_types);
+
+void codetag_lock_module_list(struct codetag_type *cttype, bool lock)
+{
+   if (lock)
+   down_read(&cttype->mod_lock);
+   else
+   up_read(&cttype->mod_lock);
+}
+
+struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
+{
+   struct codetag_iterator iter = {
+   .cttype = cttype,
+   .cmod = NULL,
+   .mod_id = 0,
+   .ct = NULL,
+   };
+
+   return iter;
+}
+
+static inline struct codetag *get_first_module_ct(struct codetag_module *cmod)
+{
+   return cmod->range.start < cmod->range.stop ? cmod->range.start : NULL;
+}
+
+static inline
+struct codetag *get_next_module_ct(struct codetag_iterator *iter)
+{
+   struct codetag *res = (struct codetag *)
+   ((char *)iter->ct + iter->cttype->desc.tag_size);
+
+   return res < ite

[PATCH v2 13/39] lib: code tagging module support

2023-10-24 Thread Suren Baghdasaryan

Add support for code tagging from dynamically loaded modules.

Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
---
 include/linux/codetag.h | 12 +
 kernel/module/main.c|  4 +++
 lib/codetag.c   | 58 +++--
 3 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index a9d7adecc2a5..386733e89b31 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -42,6 +42,10 @@ struct codetag_module {
 struct codetag_type_desc {
const char *section;
size_t tag_size;
+   void (*module_load)(struct codetag_type *cttype,
+   struct codetag_module *cmod);
+   void (*module_unload)(struct codetag_type *cttype,
+ struct codetag_module *cmod);
 };
 
 struct codetag_iterator {
@@ -68,4 +72,12 @@ void codetag_to_text(struct seq_buf *out, struct codetag 
*ct);
 struct codetag_type *
 codetag_register_type(const struct codetag_type_desc *desc);
 
+#ifdef CONFIG_CODE_TAGGING
+void codetag_load_module(struct module *mod);
+void codetag_unload_module(struct module *mod);
+#else
+static inline void codetag_load_module(struct module *mod) {}
+static inline void codetag_unload_module(struct module *mod) {}
+#endif
+
 #endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..c0d3f562c7ab 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -1242,6 +1243,7 @@ static void free_module(struct module *mod)
 {
trace_module_free(mod);
 
+   codetag_unload_module(mod);
mod_sysfs_teardown(mod);
 
/*
@@ -2975,6 +2977,8 @@ static int load_module(struct load_info *info, const char 
__user *uargs,
/* Get rid of temporary copy. */
free_copy(info, flags);
 
+   codetag_load_module(mod);
+
/* Done! */
trace_module_load(mod);
 
diff --git a/lib/codetag.c b/lib/codetag.c
index 7708f8388e55..4ea57fb37346 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -108,15 +108,20 @@ static inline size_t range_size(const struct codetag_type 
*cttype,
 static void *get_symbol(struct module *mod, const char *prefix, const char 
*name)
 {
char buf[64];
+   void *ret;
int res;
 
res = snprintf(buf, sizeof(buf), "%s%s", prefix, name);
if (WARN_ON(res < 1 || res > sizeof(buf)))
return NULL;
 
-   return mod ?
+   preempt_disable();
+   ret = mod ?
(void *)find_kallsyms_symbol_value(mod, buf) :
(void *)kallsyms_lookup_name(buf);
+   preempt_enable();
+
+   return ret;
 }
 
 static struct codetag_range get_section_range(struct module *mod,
@@ -157,8 +162,11 @@ static int codetag_module_init(struct codetag_type 
*cttype, struct module *mod)
 
down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
-   if (err >= 0)
+   if (err >= 0) {
cttype->count += range_size(cttype, &range);
+   if (cttype->desc.module_load)
+   cttype->desc.module_load(cttype, cmod);
+   }
up_write(&cttype->mod_lock);
 
if (err < 0) {
@@ -197,3 +205,49 @@ codetag_register_type(const struct codetag_type_desc *desc)
 
return cttype;
 }
+
+void codetag_load_module(struct module *mod)
+{
+   struct codetag_type *cttype;
+
+   if (!mod)
+   return;
+
+   mutex_lock(&codetag_lock);
+   list_for_each_entry(cttype, &codetag_types, link)
+   codetag_module_init(cttype, mod);
+   mutex_unlock(&codetag_lock);
+}
+
+void codetag_unload_module(struct module *mod)
+{
+   struct codetag_type *cttype;
+
+   if (!mod)
+   return;
+
+   mutex_lock(&codetag_lock);
+   list_for_each_entry(cttype, &codetag_types, link) {
+   struct codetag_module *found = NULL;
+   struct codetag_module *cmod;
+   unsigned long mod_id, tmp;
+
+   down_write(&cttype->mod_lock);
+   idr_for_each_entry_ul(&cttype->mod_idr, cmod, tmp, mod_id) {
+   if (cmod->mod && cmod->mod == mod) {
+   found = cmod;
+   break;
+   }
+   }
+   if (found) {
+   if (cttype->desc.module_unload)
+   cttype->desc.module_unload(cttype, cmod);
+
+   cttype->count -= range_size(cttype, &cmod->range);
+   idr_remove(&cttype->mod_idr, mod_id);
+   kfree(cmod);
+   }
+   up_write(&cttype->mod_lock);
+   }
+   mutex_unlock(&codetag_lock);
+}
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 14/39] lib: prevent module unloading if memory is not freed

2023-10-24 Thread Suren Baghdasaryan

Skip freeing module's data section if there are non-zero allocation tags
because otherwise, once these allocations are freed, the access to their
code tag would cause UAF.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/codetag.h |  6 +++---
 kernel/module/main.c| 23 +++
 lib/codetag.c   | 11 ---
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 386733e89b31..d98e4c8e86f0 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -44,7 +44,7 @@ struct codetag_type_desc {
size_t tag_size;
void (*module_load)(struct codetag_type *cttype,
struct codetag_module *cmod);
-   void (*module_unload)(struct codetag_type *cttype,
+   bool (*module_unload)(struct codetag_type *cttype,
  struct codetag_module *cmod);
 };
 
@@ -74,10 +74,10 @@ codetag_register_type(const struct codetag_type_desc *desc);
 
 #ifdef CONFIG_CODE_TAGGING
 void codetag_load_module(struct module *mod);
-void codetag_unload_module(struct module *mod);
+bool codetag_unload_module(struct module *mod);
 #else
 static inline void codetag_load_module(struct module *mod) {}
-static inline void codetag_unload_module(struct module *mod) {}
+static inline bool codetag_unload_module(struct module *mod) { return true; }
 #endif
 
 #endif /* _LINUX_CODETAG_H */
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c0d3f562c7ab..079f40792ce8 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1211,15 +1211,19 @@ static void *module_memory_alloc(unsigned int size, 
enum mod_mem_type type)
return module_alloc(size);
 }
 
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(void *ptr, enum mod_mem_type type,
+  bool unload_codetags)
 {
+   if (!unload_codetags && mod_mem_type_is_core_data(type))
+   return;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
module_memfree(ptr);
 }
 
-static void free_mod_mem(struct module *mod)
+static void free_mod_mem(struct module *mod, bool unload_codetags)
 {
for_each_mod_mem_type(type) {
struct module_memory *mod_mem = &mod->mem[type];
@@ -1230,20 +1234,23 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
-   module_memory_free(mod_mem->base, type);
+   module_memory_free(mod_mem->base, type,
+  unload_codetags);
}
 
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, 
mod->mem[MOD_DATA].size);
-   module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+   module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA, unload_codetags);
 }
 
 /* Free a module, remove from lists, etc. */
 static void free_module(struct module *mod)
 {
+   bool unload_codetags;
+
trace_module_free(mod);
 
-   codetag_unload_module(mod);
+   unload_codetags = codetag_unload_module(mod);
mod_sysfs_teardown(mod);
 
/*
@@ -1285,7 +1292,7 @@ static void free_module(struct module *mod)
kfree(mod->args);
percpu_modfree(mod);
 
-   free_mod_mem(mod);
+   free_mod_mem(mod, unload_codetags);
 }
 
 void *__symbol_get(const char *symbol)
@@ -2295,7 +2302,7 @@ static int move_module(struct module *mod, struct 
load_info *info)
return 0;
 out_enomem:
for (t--; t >= 0; t--)
-   module_memory_free(mod->mem[t].base, t);
+   module_memory_free(mod->mem[t].base, t, true);
return ret;
 }
 
@@ -2425,7 +2432,7 @@ static void module_deallocate(struct module *mod, struct 
load_info *info)
percpu_modfree(mod);
module_arch_freeing_init(mod);
 
-   free_mod_mem(mod);
+   free_mod_mem(mod, true);
 }
 
 int __weak module_finalize(const Elf_Ehdr *hdr,
diff --git a/lib/codetag.c b/lib/codetag.c
index 4ea57fb37346..0ad4ea66c769 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct codetag_type {
struct list_head link;
@@ -219,12 +220,13 @@ void codetag_load_module(struct module *mod)
mutex_unlock(&codetag_lock);
 }
 
-void codetag_unload_module(struct module *mod)
+bool codetag_unload_module(struct module *mod)
 {
struct codetag_type *cttype;
+   bool unload_ok = true;
 
if (!mod)
-   return;
+   return true;
 
mutex_lock(&codetag_lock);
list_for_each_entry(cttype, &codetag_types, link) {
@@ -241,7 +243,8 @@ void codetag

[PATCH v2 15/39] lib: add allocation tagging support for memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators. It registers an "alloc_tags" codetag type
with /proc/allocinfo interface to output allocation tag information when
the feature is enabled.
CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.
Memory allocation profiling can be enabled or disabled at runtime using
/proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
profiling by default.

Co-developed-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Kent Overstreet 
---
 Documentation/admin-guide/sysctl/vm.rst |  16 +++
 Documentation/filesystems/proc.rst  |  28 +
 include/asm-generic/codetag.lds.h   |  14 +++
 include/asm-generic/vmlinux.lds.h   |   3 +
 include/linux/alloc_tag.h   | 133 
 include/linux/sched.h   |  24 
 lib/Kconfig.debug   |  25 
 lib/Makefile|   2 +
 lib/alloc_tag.c | 158 
 scripts/module.lds.S|   7 ++
 10 files changed, 410 insertions(+)
 create mode 100644 include/asm-generic/codetag.lds.h
 create mode 100644 include/linux/alloc_tag.h
 create mode 100644 lib/alloc_tag.c

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index 45ba1f4dc004..0a012ac13a38 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -43,6 +43,7 @@ Currently, these files are in /proc/sys/vm:
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
+- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
 - memory_failure_early_kill
 - memory_failure_recovery
 - min_free_kbytes
@@ -425,6 +426,21 @@ e.g., up to one or two maps per allocation.
 The default value is 65530.
 
 
+mem_profiling
+==
+
+Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
+
+1: Enable memory profiling.
+
+0: Disabld memory profiling.
+
+Enabling memory profiling introduces a small performance overhead for all
+memory allocations.
+
+The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
+
+
 memory_failure_early_kill:
 ==
 
diff --git a/Documentation/filesystems/proc.rst 
b/Documentation/filesystems/proc.rst
index 2b59cff8be17..41a7e6d95fe8 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -688,6 +688,7 @@ files are there, and which are missing.
   ===
  File Content
   ===
+ allocinfoMemory allocations profiling information
  apm  Advanced power management info
  buddyinfoKernel memory allocator information (see text)   (2.5)
  bus  Directory containing bus specific information
@@ -947,6 +948,33 @@ also be allocatable although a lot of filesystem metadata 
may have to be
 reclaimed to achieve this.
 
 
+allocinfo
+~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module and the function calling the allocation. The number of bytes
+allocated at each location is reported.
+
+Example output.
+
+::
+
+> cat /proc/allocinfo
+
+  153MiB mm/slub.c:1826 module:slub func:alloc_slab_page
+ 6.08MiB mm/slab_common.c:950 module:slab_common func:_kmalloc_order
+ 5.09MiB mm/memcontrol.c:2814 module:memcontrol 
func:alloc_slab_obj_exts
+ 4.54MiB mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
+ 1.32MiB include/asm-generic/pgalloc.h:63 module:pgtable 
func:__pte_alloc_one
+ 1.16MiB fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
+ 1.00MiB mm/swap_cgroup.c:48 module:swap_cgroup 
func:swap_cgroup_prepare
+  734KiB fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
+  640KiB kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
+  640KiB drivers/char/virtio_console.c:452 module:virtio_console 
func:alloc_buf
+  ...
+
+
 meminfo
 ~~~
 
diff --git a/include/asm-generic/codetag.lds.h 
b/include/asm-generic/codetag.lds.h
new file mode 100644
index ..64f536b80380
--- /dev/null
+++ b/include/asm-generic/codetag.lds.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __ASM_GENERIC_CODETAG_LDS_H
+#define __ASM_GENERIC_CODETAG_LDS_H
+
+#define SECTION_WITH_BOUNDARIES(_name) \
+   . = ALIGN(8);   \
+   __start_##_name = .;\
+   KEEP(*(_name))  \
+   __stop_##_name = .;
+
+#define CODETAG_SECTIONS() \
+   SECTION_WITH_BOUNDARI

[PATCH v2 16/39] lib: introduce support for page allocation tagging

2023-10-24 Thread Suren Baghdasaryan

Introduce helper functions to easily instrument page allocators by
storing a pointer to the allocation tag associated with the code that
allocated the page in a page_ext field.

Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
---
 include/linux/page_ext.h|  1 -
 include/linux/pgalloc_tag.h | 73 +
 lib/Kconfig.debug   |  1 +
 lib/alloc_tag.c | 17 +
 mm/mm_init.c|  1 +
 mm/page_alloc.c |  4 ++
 mm/page_ext.c   |  4 ++
 7 files changed, 100 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pgalloc_tag.h

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index be98564191e6..07e0656898f9 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -4,7 +4,6 @@
 
 #include 
 #include 
-#include 
 
 struct pglist_data;
 
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
new file mode 100644
index ..a060c26eb449
--- /dev/null
+++ b/include/linux/pgalloc_tag.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * page allocation tagging
+ */
+#ifndef _LINUX_PGALLOC_TAG_H
+#define _LINUX_PGALLOC_TAG_H
+
+#include 
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+#include 
+
+extern struct page_ext_operations page_alloc_tagging_ops;
+extern struct page_ext *page_ext_get(struct page *page);
+extern void page_ext_put(struct page_ext *page_ext);
+
+static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext 
*page_ext)
+{
+   return (void *)page_ext + page_alloc_tagging_ops.offset;
+}
+
+static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref 
*ref)
+{
+   return (void *)ref - page_alloc_tagging_ops.offset;
+}
+
+static inline union codetag_ref *get_page_tag_ref(struct page *page)
+{
+   if (page && mem_alloc_profiling_enabled()) {
+   struct page_ext *page_ext = page_ext_get(page);
+
+   if (page_ext)
+   return codetag_ref_from_page_ext(page_ext);
+   }
+   return NULL;
+}
+
+static inline void put_page_tag_ref(union codetag_ref *ref)
+{
+   page_ext_put(page_ext_from_codetag_ref(ref));
+}
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+  unsigned int order)
+{
+   union codetag_ref *ref = get_page_tag_ref(page);
+
+   if (ref) {
+   alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE << order);
+   put_page_tag_ref(ref);
+   }
+}
+
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order)
+{
+   union codetag_ref *ref = get_page_tag_ref(page);
+
+   if (ref) {
+   alloc_tag_sub(ref, PAGE_SIZE << order);
+   put_page_tag_ref(ref);
+   }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING */
+
+static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
+  unsigned int order) {}
+static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
+#endif /* _LINUX_PGALLOC_TAG_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 475a14e70566..e1eda1450d68 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -972,6 +972,7 @@ config MEM_ALLOC_PROFILING
depends on PROC_FS
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
+   select PAGE_EXTENSION
help
  Track allocation source code and record total allocation size
  initiated at that code location. The mechanism can be used to track
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 4fc031f9cefd..2d5226d9262d 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -124,6 +125,22 @@ static bool alloc_tag_module_unload(struct codetag_type 
*cttype,
return module_unused;
 }
 
+static __init bool need_page_alloc_tagging(void)
+{
+   return true;
+}
+
+static __init void init_page_alloc_tagging(void)
+{
+}
+
+struct page_ext_operations page_alloc_tagging_ops = {
+   .size = sizeof(union codetag_ref),
+   .need = need_page_alloc_tagging,
+   .init = init_page_alloc_tagging,
+};
+EXPORT_SYMBOL(page_alloc_tagging_ops);
+
 static struct ctl_table memory_allocation_profiling_sysctls[] = {
{
.procname   = "mem_profiling",
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 50f2f34745af..8e72e431dc35 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95546f376302..d490d0f73e72 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include

[PATCH v2 17/39] change alloc_pages name in dma_map_ops to avoid name conflicts

2023-10-24 Thread Suren Baghdasaryan

After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/kernel/amd_gart_64.c | 2 +-
 drivers/iommu/dma-iommu.c | 2 +-
 drivers/xen/grant-dma-ops.c   | 2 +-
 drivers/xen/swiotlb-xen.c | 2 +-
 include/linux/dma-map-ops.h   | 2 +-
 kernel/dma/mapping.c  | 4 ++--
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 56a917df410d..842a0ec5eaa9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -676,7 +676,7 @@ static const struct dma_map_ops gart_dma_ops = {
.get_sgtable= dma_common_get_sgtable,
.dma_supported  = dma_direct_supported,
.get_required_mask  = dma_direct_get_required_mask,
-   .alloc_pages= dma_direct_alloc_pages,
+   .alloc_pages_op = dma_direct_alloc_pages,
.free_pages = dma_direct_free_pages,
 };
 
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 4b1a88f514c9..28b7b2d10655 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1603,7 +1603,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.flags  = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
-   .alloc_pages= dma_common_alloc_pages,
+   .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
.alloc_noncontiguous= iommu_dma_alloc_noncontiguous,
.free_noncontiguous = iommu_dma_free_noncontiguous,
diff --git a/drivers/xen/grant-dma-ops.c b/drivers/xen/grant-dma-ops.c
index 76f6f26265a3..29257d2639db 100644
--- a/drivers/xen/grant-dma-ops.c
+++ b/drivers/xen/grant-dma-ops.c
@@ -282,7 +282,7 @@ static int xen_grant_dma_supported(struct device *dev, u64 
mask)
 static const struct dma_map_ops xen_grant_dma_ops = {
.alloc = xen_grant_dma_alloc,
.free = xen_grant_dma_free,
-   .alloc_pages = xen_grant_dma_alloc_pages,
+   .alloc_pages_op = xen_grant_dma_alloc_pages,
.free_pages = xen_grant_dma_free_pages,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 946bd56f0ac5..4f1e3f1fc44e 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -403,6 +403,6 @@ const struct dma_map_ops xen_swiotlb_dma_ops = {
.dma_supported = xen_swiotlb_dma_supported,
.mmap = dma_common_mmap,
.get_sgtable = dma_common_get_sgtable,
-   .alloc_pages = dma_common_alloc_pages,
+   .alloc_pages_op = dma_common_alloc_pages,
.free_pages = dma_common_free_pages,
 };
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index f2fc203fb8a1..3a8a015fdd2e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -28,7 +28,7 @@ struct dma_map_ops {
unsigned long attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, unsigned long attrs);
-   struct page *(*alloc_pages)(struct device *dev, size_t size,
+   struct page *(*alloc_pages_op)(struct device *dev, size_t size,
dma_addr_t *dma_handle, enum dma_data_direction dir,
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index e323ca48f7f2..58e490e2cfb4 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -570,9 +570,9 @@ static struct page *__dma_alloc_pages(struct device *dev, 
size_t size,
size = PAGE_ALIGN(size);
if (dma_alloc_direct(dev, ops))
return dma_direct_alloc_pages(dev, size, dma_handle, dir, gfp);
-   if (!ops->alloc_pages)
+   if (!ops->alloc_pages_op)
return NULL;
-   return ops->alloc_pages(dev, size, dma_handle, dir, gfp);
+   return ops->alloc_pages_op(dev, size, dma_handle, dir, gfp);
 }
 
 struct page *dma_alloc_pages(struct device *dev, size_t size,
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 18/39] change alloc_pages name in ivpu_bo_ops to avoid conflicts

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

After redefining alloc_pages, all uses of that name are being replaced.
Change the conflicting names to prevent preprocessor from replacing them
when it's not intended.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 drivers/accel/ivpu/ivpu_gem.c | 8 
 drivers/accel/ivpu/ivpu_gem.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/ivpu/ivpu_gem.c b/drivers/accel/ivpu/ivpu_gem.c
index d09f13b35902..d324eaf5bbe3 100644
--- a/drivers/accel/ivpu/ivpu_gem.c
+++ b/drivers/accel/ivpu/ivpu_gem.c
@@ -61,7 +61,7 @@ static void prime_unmap_pages_locked(struct ivpu_bo *bo)
 static const struct ivpu_bo_ops prime_ops = {
.type = IVPU_BO_TYPE_PRIME,
.name = "prime",
-   .alloc_pages = prime_alloc_pages_locked,
+   .alloc_pages_op = prime_alloc_pages_locked,
.free_pages = prime_free_pages_locked,
.map_pages = prime_map_pages_locked,
.unmap_pages = prime_unmap_pages_locked,
@@ -134,7 +134,7 @@ static void ivpu_bo_unmap_pages_locked(struct ivpu_bo *bo)
 static const struct ivpu_bo_ops shmem_ops = {
.type = IVPU_BO_TYPE_SHMEM,
.name = "shmem",
-   .alloc_pages = shmem_alloc_pages_locked,
+   .alloc_pages_op = shmem_alloc_pages_locked,
.free_pages = shmem_free_pages_locked,
.map_pages = ivpu_bo_map_pages_locked,
.unmap_pages = ivpu_bo_unmap_pages_locked,
@@ -186,7 +186,7 @@ static void internal_free_pages_locked(struct ivpu_bo *bo)
 static const struct ivpu_bo_ops internal_ops = {
.type = IVPU_BO_TYPE_INTERNAL,
.name = "internal",
-   .alloc_pages = internal_alloc_pages_locked,
+   .alloc_pages_op = internal_alloc_pages_locked,
.free_pages = internal_free_pages_locked,
.map_pages = ivpu_bo_map_pages_locked,
.unmap_pages = ivpu_bo_unmap_pages_locked,
@@ -200,7 +200,7 @@ static int __must_check 
ivpu_bo_alloc_and_map_pages_locked(struct ivpu_bo *bo)
lockdep_assert_held(&bo->lock);
drm_WARN_ON(&vdev->drm, bo->sgt);
 
-   ret = bo->ops->alloc_pages(bo);
+   ret = bo->ops->alloc_pages_op(bo);
if (ret) {
ivpu_err(vdev, "Failed to allocate pages for BO: %d", ret);
return ret;
diff --git a/drivers/accel/ivpu/ivpu_gem.h b/drivers/accel/ivpu/ivpu_gem.h
index 6b0ceda5f253..b81cf2af0b2d 100644
--- a/drivers/accel/ivpu/ivpu_gem.h
+++ b/drivers/accel/ivpu/ivpu_gem.h
@@ -42,7 +42,7 @@ enum ivpu_bo_type {
 struct ivpu_bo_ops {
enum ivpu_bo_type type;
const char *name;
-   int (*alloc_pages)(struct ivpu_bo *bo);
+   int (*alloc_pages_op)(struct ivpu_bo *bo);
void (*free_pages)(struct ivpu_bo *bo);
int (*map_pages)(struct ivpu_bo *bo);
void (*unmap_pages)(struct ivpu_bo *bo);
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 20/39] mm: create new codetag references during page splitting

2023-10-24 Thread Suren Baghdasaryan

When a high-order page is split into smaller ones, each newly split
page should get its codetag. The original codetag is reused for these
pages but it's recorded as 0-byte allocation because original codetag
already accounts for the original high-order allocated page.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/pgalloc_tag.h | 30 ++
 mm/huge_memory.c|  2 ++
 mm/page_alloc.c |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index a060c26eb449..0174aff5e871 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -62,11 +62,41 @@ static inline void pgalloc_tag_sub(struct page *page, 
unsigned int order)
}
 }
 
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr)
+{
+   int i;
+   struct page_ext *page_ext;
+   union codetag_ref *ref;
+   struct alloc_tag *tag;
+
+   if (!mem_alloc_profiling_enabled())
+   return;
+
+   page_ext = page_ext_get(page);
+   if (unlikely(!page_ext))
+   return;
+
+   ref = codetag_ref_from_page_ext(page_ext);
+   if (!ref->ct)
+   goto out;
+
+   tag = ct_to_alloc_tag(ref->ct);
+   page_ext = page_ext_next(page_ext);
+   for (i = 1; i < nr; i++) {
+   /* New reference with 0 bytes accounted */
+   alloc_tag_add(codetag_ref_from_page_ext(page_ext), tag, 0);
+   page_ext = page_ext_next(page_ext);
+   }
+out:
+   page_ext_put(page_ext);
+}
+
 #else /* CONFIG_MEM_ALLOC_PROFILING */
 
 static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
   unsigned int order) {}
 static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
+static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
 
 #endif /* CONFIG_MEM_ALLOC_PROFILING */
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 064fbd90822b..392b6907d875 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2545,6 +2546,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
/* Caller disabled irqs, so they are still disabled here */
 
split_page_owner(head, nr);
+   pgalloc_tag_split(head, nr);
 
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63dc2f8c7901..c4f0cd127e14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2540,6 +2540,7 @@ void split_page(struct page *page, unsigned int order)
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
split_page_owner(page, 1 << order);
+   pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
 }
 EXPORT_SYMBOL_GPL(split_page);
@@ -4669,6 +4670,7 @@ static void *make_alloc_exact(unsigned long addr, 
unsigned int order,
struct page *last = page + nr;
 
split_page_owner(page, 1 << order);
+   pgalloc_tag_split(page, 1 << order);
split_page_memcg(page, 1 << order);
while (page < --last)
set_page_refcounted(last);
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 19/39] mm: enable page allocation tagging

2023-10-24 Thread Suren Baghdasaryan

Redefine page allocators to record allocation tags upon their invocation.
Instrument post_alloc_hook and free_pages_prepare to modify current
allocation tag.

Co-developed-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Kent Overstreet 
---
 include/linux/alloc_tag.h |  10 
 include/linux/gfp.h   | 111 +++---
 include/linux/pagemap.h   |   9 ++--
 mm/compaction.c   |   7 ++-
 mm/filemap.c  |   6 +--
 mm/mempolicy.c|  42 +++
 mm/page_alloc.c   |  60 ++---
 7 files changed, 144 insertions(+), 101 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index cf55a149fa84..6fa8a94d8bc1 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -130,4 +130,14 @@ static inline void alloc_tag_add(union codetag_ref *ref, 
struct alloc_tag *tag,
 
 #endif
 
+#define alloc_hooks(_do_alloc) \
+({ \
+   typeof(_do_alloc) _res; \
+   DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+   \
+   _res = _do_alloc;   \
+   alloc_tag_restore(&_alloc_tag, _old);   \
+   _res;   \
+})
+
 #endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..20686fd1f417 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -6,6 +6,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 struct vm_area_struct;
 
@@ -174,42 +176,43 @@ static inline void arch_free_page(struct page *page, int 
order) { }
 static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
-struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int 
preferred_nid,
nodemask_t *nodemask);
-struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
+#define __alloc_pages(...) 
alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
+
+struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int 
preferred_nid,
nodemask_t *nodemask);
+#define __folio_alloc(...) 
alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))
 
-unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
+#define __alloc_pages_bulk(...)
alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
 
-unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
+#define  alloc_pages_bulk_array_mempolicy(...) 
alloc_hooks(alloc_pages_bulk_array_mempolicy_noprof(__VA_ARGS__))
 
 /* Bulk allocate order-0 pages */
-static inline unsigned long
-alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head 
*list)
-{
-   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, 
NULL);
-}
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list)  \
+   __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
 
-static inline unsigned long
-alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
**page_array)
-{
-   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
page_array);
-}
+#define alloc_pages_bulk_array(_gfp, _nr_pages, _page_array)   \
+   __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, 
_page_array)
 
 static inline unsigned long
-alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct 
page **page_array)
+alloc_pages_bulk_array_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages, 
struct page **page_array)
 {
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
 
-   return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
+   return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, 
page_array);
 }
 
+#define alloc_pages_bulk_array_node(...)   
alloc_hooks(alloc_pages_bulk_array_node_noprof(__VA_ARGS__))
+
 static inline void warn_if_node_offline(int this_node, gfp_t gfp_mask)
 {
gfp_t warn_gfp = gfp_mask & (__GFP_THISNODE|__GFP_NOWARN);
@@ -229,21 +232,23 @@ static inline void warn_if_node_offline(int this_node, 
gfp_t gfp_mask)
  * online. For more general interf

[PATCH v2 21/39] mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y

2023-10-24 Thread Suren Baghdasaryan

For all page allocations to be tagged, page_ext has to be initialized
before the first page allocation. Early tasks allocate their stacks
using page allocator before alloc_node_page_ext() initializes page_ext
area, unless early_page_ext is enabled. Therefore these allocations will
generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to
ensure page_ext initialization prior to any page allocation. This will
have all the negative effects associated with early_page_ext, such as
possible longer boot time, therefore we enable it only when debugging
with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for
CONFIG_MEM_ALLOC_PROFILING.

Signed-off-by: Suren Baghdasaryan 
---
 mm/page_ext.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 3c58fe8a24df..e7d8f1a5589e 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -95,7 +95,16 @@ unsigned long page_ext_size;
 
 static unsigned long total_usage;
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+/*
+ * To ensure correct allocation tagging for pages, page_ext should be available
+ * before the first page allocation. Otherwise early task stacks will be
+ * allocated before page_ext initialization and missing tags will be flagged.
+ */
+bool early_page_ext __meminitdata = true;
+#else
 bool early_page_ext __meminitdata;
+#endif
 static int __init setup_early_page_ext(char *str)
 {
early_page_ext = true;
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 22/39] lib: add codetag reference into slabobj_ext

2023-10-24 Thread Suren Baghdasaryan

To store code tag for every slab object, a codetag reference is embedded
into slabobj_ext when CONFIG_MEM_ALLOC_PROFILING=y.

Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
---
 include/linux/memcontrol.h | 5 +
 lib/Kconfig.debug  | 1 +
 mm/slab.h  | 4 
 3 files changed, 10 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f3ede28b6fa6..853a24b5f713 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1613,7 +1613,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
*pgdat, int order,
  * if MEMCG_DATA_OBJEXTS is set.
  */
 struct slabobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *objcg;
+#endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   union codetag_ref ref;
+#endif
 } __aligned(8);
 
 static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e1eda1450d68..482a6aae7664 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -973,6 +973,7 @@ config MEM_ALLOC_PROFILING
depends on !DEBUG_FORCE_WEAK_PER_CPU
select CODE_TAGGING
select PAGE_EXTENSION
+   select SLAB_OBJ_EXT
help
  Track allocation source code and record total allocation size
  initiated at that code location. The mechanism can be used to track
diff --git a/mm/slab.h b/mm/slab.h
index 60417fd262ea..293210ed10a9 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -457,6 +457,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct 
kmem_cache *s,
 
 static inline bool need_slab_obj_ext(void)
 {
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   if (mem_alloc_profiling_enabled())
+   return true;
+#endif
/*
 * CONFIG_MEMCG_KMEM creates vector of obj_cgroup objects conditionally
 * inside memcg_slab_post_alloc_hook. No other users for now.
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 23/39] mm/slab: add allocation accounting into slab allocation and free paths

2023-10-24 Thread Suren Baghdasaryan

Account slab allocations using codetag reference embedded into slabobj_ext.

Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
---
 include/linux/slab_def.h |  2 +-
 include/linux/slub_def.h |  4 ++--
 mm/slab.c|  4 +++-
 mm/slab.h| 32 
 4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index a61e7d55d0d3..23f14dcb8d5b 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -107,7 +107,7 @@ static inline void *nearest_obj(struct kmem_cache *cache, 
const struct slab *sla
  *   reciprocal_divide(offset, cache->reciprocal_buffer_size)
  */
 static inline unsigned int obj_to_index(const struct kmem_cache *cache,
-   const struct slab *slab, void *obj)
+   const struct slab *slab, const void 
*obj)
 {
u32 offset = (obj - slab->s_mem);
return reciprocal_divide(offset, cache->reciprocal_buffer_size);
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index deb90cf4bffb..43fda4a5f23a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -182,14 +182,14 @@ static inline void *nearest_obj(struct kmem_cache *cache, 
const struct slab *sla
 
 /* Determine object index from a given position */
 static inline unsigned int __obj_to_index(const struct kmem_cache *cache,
- void *addr, void *obj)
+ void *addr, const void *obj)
 {
return reciprocal_divide(kasan_reset_tag(obj) - addr,
 cache->reciprocal_size);
 }
 
 static inline unsigned int obj_to_index(const struct kmem_cache *cache,
-   const struct slab *slab, void *obj)
+   const struct slab *slab, const void 
*obj)
 {
if (is_kfence_address(obj))
return 0;
diff --git a/mm/slab.c b/mm/slab.c
index cefcb7499b6c..18923f5f05b5 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3348,9 +3348,11 @@ static void cache_flusharray(struct kmem_cache *cachep, 
struct array_cache *ac)
 static __always_inline void __cache_free(struct kmem_cache *cachep, void *objp,
 unsigned long caller)
 {
+   struct slab *slab = virt_to_slab(objp);
bool init;
 
-   memcg_slab_free_hook(cachep, virt_to_slab(objp), &objp, 1);
+   memcg_slab_free_hook(cachep, slab, &objp, 1);
+   alloc_tagging_slab_free_hook(cachep, slab, &objp, 1);
 
if (is_kfence_address(objp)) {
kmemleak_free_recursive(objp, cachep->flags);
diff --git a/mm/slab.h b/mm/slab.h
index 293210ed10a9..4859ce1f8808 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -533,6 +533,32 @@ prepare_slab_obj_exts_hook(struct kmem_cache *s, gfp_t 
flags, void *p)
 
 #endif /* CONFIG_SLAB_OBJ_EXT */
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct 
slab *slab,
+   void **p, int objects)
+{
+   struct slabobj_ext *obj_exts;
+   int i;
+
+   obj_exts = slab_obj_exts(slab);
+   if (!obj_exts)
+   return;
+
+   for (i = 0; i < objects; i++) {
+   unsigned int off = obj_to_index(s, slab, p[i]);
+
+   alloc_tag_sub(&obj_exts[off].ref, s->size);
+   }
+}
+
+#else
+
+static inline void alloc_tagging_slab_free_hook(struct kmem_cache *s, struct 
slab *slab,
+   void **p, int objects) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING */
+
 #ifdef CONFIG_MEMCG_KMEM
 void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 enum node_stat_item idx, int nr);
@@ -827,6 +853,12 @@ static inline void slab_post_alloc_hook(struct kmem_cache 
*s,
 s->flags, flags);
kmsan_slab_alloc(s, p[i], flags);
obj_exts = prepare_slab_obj_exts_hook(s, flags, p[i]);
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   /* obj_exts can be allocated for other reasons */
+   if (likely(obj_exts) && mem_alloc_profiling_enabled())
+   alloc_tag_add(&obj_exts->ref, current->alloc_tag, 
s->size);
+#endif
}
 
memcg_slab_post_alloc_hook(s, objcg, flags, size, p);
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 25/39] mm/slub: Mark slab_free_freelist_hook() __always_inline

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

It seems we need to be more forceful with the compiler on this one.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index f5e07d8802e2..222c16cef729 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1800,7 +1800,7 @@ static __always_inline bool slab_free_hook(struct 
kmem_cache *s,
return kasan_slab_free(s, x, init);
 }
 
-static inline bool slab_free_freelist_hook(struct kmem_cache *s,
+static __always_inline bool slab_free_freelist_hook(struct kmem_cache *s,
   void **head, void **tail,
   int *cnt)
 {
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 24/39] mm/slab: enable slab allocation tagging for kmalloc and friends

2023-10-24 Thread Suren Baghdasaryan

Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations
and deallocations done by these functions.

Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
---
 include/linux/fortify-string.h |   5 +-
 include/linux/slab.h   | 173 -
 include/linux/string.h |   4 +-
 mm/slab.c  |  18 ++--
 mm/slab_common.c   |  38 
 mm/slub.c  |  19 ++--
 mm/util.c  |  20 ++--
 7 files changed, 137 insertions(+), 140 deletions(-)

diff --git a/include/linux/fortify-string.h b/include/linux/fortify-string.h
index da51a83b2829..11319e7634a4 100644
--- a/include/linux/fortify-string.h
+++ b/include/linux/fortify-string.h
@@ -752,9 +752,9 @@ __FORTIFY_INLINE void *memchr_inv(const void * const POS0 
p, int c, size_t size)
return __real_memchr_inv(p, c, size);
 }
 
-extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) 
__RENAME(kmemdup)
+extern void *__real_kmemdup(const void *src, size_t len, gfp_t gfp) 
__RENAME(kmemdup_noprof)

__realloc_size(2);
-__FORTIFY_INLINE void *kmemdup(const void * const POS0 p, size_t size, gfp_t 
gfp)
+__FORTIFY_INLINE void *kmemdup_noprof(const void * const POS0 p, size_t size, 
gfp_t gfp)
 {
const size_t p_size = __struct_size(p);
 
@@ -764,6 +764,7 @@ __FORTIFY_INLINE void *kmemdup(const void * const POS0 p, 
size_t size, gfp_t gfp
fortify_panic(__func__);
return __real_kmemdup(p, size, gfp);
 }
+#define kmemdup(...)   alloc_hooks(kmemdup_noprof(__VA_ARGS__))
 
 /**
  * strcpy - Copy a string into another string buffer
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 11ef3d364b2b..0543e0f76c60 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -230,7 +230,9 @@ int kmem_cache_shrink(struct kmem_cache *s);
 /*
  * Common kmalloc functions provided by all allocators
  */
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) 
__realloc_size(2);
+void * __must_check krealloc_noprof(const void *objp, size_t new_size, gfp_t 
flags) __realloc_size(2);
+#define krealloc(...)  
alloc_hooks(krealloc_noprof(__VA_ARGS__))
+
 void kfree(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
@@ -491,7 +493,10 @@ static __always_inline unsigned int __kmalloc_index(size_t 
size,
 static_assert(PAGE_SHIFT <= 20);
 #define kmalloc_index(s) __kmalloc_index(s, true)
 
-void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment 
__alloc_size(1);
+#include 
+
+void *__kmalloc_noprof(size_t size, gfp_t flags) __assume_kmalloc_alignment 
__alloc_size(1);
+#define __kmalloc(...) 
alloc_hooks(__kmalloc_noprof(__VA_ARGS__))
 
 /**
  * kmem_cache_alloc - Allocate an object
@@ -503,9 +508,13 @@ void *__kmalloc(size_t size, gfp_t flags) 
__assume_kmalloc_alignment __alloc_siz
  *
  * Return: pointer to the new object or %NULL in case of error
  */
-void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags) 
__assume_slab_alignment __malloc;
-void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
-  gfp_t gfpflags) __assume_slab_alignment __malloc;
+void *kmem_cache_alloc_noprof(struct kmem_cache *cachep, gfp_t flags) 
__assume_slab_alignment __malloc;
+#define kmem_cache_alloc(...)  
alloc_hooks(kmem_cache_alloc_noprof(__VA_ARGS__))
+
+void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
+   gfp_t gfpflags) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_lru(...)  
alloc_hooks(kmem_cache_alloc_lru_noprof(__VA_ARGS__))
+
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
 /*
@@ -516,29 +525,40 @@ void kmem_cache_free(struct kmem_cache *s, void *objp);
  * Note that interrupts must be enabled when calling these functions.
  */
 void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
-int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void 
**p);
+
+int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t 
size, void **p);
+#define kmem_cache_alloc_bulk(...) 
alloc_hooks(kmem_cache_alloc_bulk_noprof(__VA_ARGS__))
 
 static __always_inline void kfree_bulk(size_t size, void **p)
 {
kmem_cache_free_bulk(NULL, size, p);
 }
 
-void *__kmalloc_node(size_t size, gfp_t flags, int node) 
__assume_kmalloc_alignment
+void *__kmalloc_node_noprof(size_t size, gfp_t flags, int node) 
__assume_kmalloc_alignment
 __alloc_size(1);
-void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) 
__assume_slab_alignment
-
__malloc;
+#def

[PATCH v2 26/39] mempool: Hook up to memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

This adds hooks to mempools for correctly annotating mempool-backed
allocations at the correct source line, so they show up correctly in
/sys/kernel/debug/allocations.

Various inline functions are converted to wrappers so that we can invoke
alloc_hooks() in fewer places.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mempool.h | 73 -
 mm/mempool.c| 34 ---
 2 files changed, 48 insertions(+), 59 deletions(-)

diff --git a/include/linux/mempool.h b/include/linux/mempool.h
index 4aae6c06c5f2..9fa126aa19b5 100644
--- a/include/linux/mempool.h
+++ b/include/linux/mempool.h
@@ -5,6 +5,8 @@
 #ifndef _LINUX_MEMPOOL_H
 #define _LINUX_MEMPOOL_H
 
+#include 
+#include 
 #include 
 #include 
 
@@ -39,18 +41,32 @@ void mempool_exit(mempool_t *pool);
 int mempool_init_node(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
  mempool_free_t *free_fn, void *pool_data,
  gfp_t gfp_mask, int node_id);
-int mempool_init(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
+
+int mempool_init_noprof(mempool_t *pool, int min_nr, mempool_alloc_t *alloc_fn,
 mempool_free_t *free_fn, void *pool_data);
+#define mempool_init(...)  \
+   alloc_hooks(mempool_init_noprof(__VA_ARGS__))
 
 extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
mempool_free_t *free_fn, void *pool_data);
-extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
+
+extern mempool_t *mempool_create_node_noprof(int min_nr, mempool_alloc_t 
*alloc_fn,
mempool_free_t *free_fn, void *pool_data,
gfp_t gfp_mask, int nid);
+#define mempool_create_node(...)   \
+   alloc_hooks(mempool_create_node_noprof(__VA_ARGS__))
+
+#define mempool_create(_min_nr, _alloc_fn, _free_fn, _pool_data)   \
+   mempool_create_node(_min_nr, _alloc_fn, _free_fn, _pool_data,   \
+   GFP_KERNEL, NUMA_NO_NODE)
 
 extern int mempool_resize(mempool_t *pool, int new_min_nr);
 extern void mempool_destroy(mempool_t *pool);
-extern void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) __malloc;
+
+extern void *mempool_alloc_noprof(mempool_t *pool, gfp_t gfp_mask) __malloc;
+#define mempool_alloc(...) \
+   alloc_hooks(mempool_alloc_noprof(__VA_ARGS__))
+
 extern void mempool_free(void *element, mempool_t *pool);
 
 /*
@@ -61,19 +77,10 @@ extern void mempool_free(void *element, mempool_t *pool);
 void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
 void mempool_free_slab(void *element, void *pool_data);
 
-static inline int
-mempool_init_slab_pool(mempool_t *pool, int min_nr, struct kmem_cache *kc)
-{
-   return mempool_init(pool, min_nr, mempool_alloc_slab,
-   mempool_free_slab, (void *) kc);
-}
-
-static inline mempool_t *
-mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
-{
-   return mempool_create(min_nr, mempool_alloc_slab, mempool_free_slab,
- (void *) kc);
-}
+#define mempool_init_slab_pool(_pool, _min_nr, _kc)\
+   mempool_init(_pool, (_min_nr), mempool_alloc_slab, mempool_free_slab, 
(void *)(_kc))
+#define mempool_create_slab_pool(_min_nr, _kc) \
+   mempool_create((_min_nr), mempool_alloc_slab, mempool_free_slab, (void 
*)(_kc))
 
 /*
  * a mempool_alloc_t and a mempool_free_t to kmalloc and kfree the
@@ -82,17 +89,12 @@ mempool_create_slab_pool(int min_nr, struct kmem_cache *kc)
 void *mempool_kmalloc(gfp_t gfp_mask, void *pool_data);
 void mempool_kfree(void *element, void *pool_data);
 
-static inline int mempool_init_kmalloc_pool(mempool_t *pool, int min_nr, 
size_t size)
-{
-   return mempool_init(pool, min_nr, mempool_kmalloc,
-   mempool_kfree, (void *) size);
-}
-
-static inline mempool_t *mempool_create_kmalloc_pool(int min_nr, size_t size)
-{
-   return mempool_create(min_nr, mempool_kmalloc, mempool_kfree,
- (void *) size);
-}
+#define mempool_init_kmalloc_pool(_pool, _min_nr, _size)   \
+   mempool_init(_pool, (_min_nr), mempool_kmalloc, mempool_kfree,  \
+(void *)(unsigned long)(_size))
+#define mempool_create_kmalloc_pool(_min_nr, _size)\
+   mempool_create((_min_nr), mempool_kmalloc, mempool_kfree,   \
+  (void *)(unsigned long)(_size))
 
 /*
  * A mempool_alloc_t and mempool_free_t for a simple page allocator that
@@ -101,16 +103,11 @@ static inline mempool_t *mempool_create_kmalloc_pool(int 
min_nr, size_t size)
 void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
 void mempool_free_pages(void *element, void *pool_data);
 
-static

[PATCH v2 27/39] xfs: Memory allocation profiling fixups

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

This adds an alloc_hooks() wrapper around kmem_alloc(), so that we can
have allocations accounted to the proper callsite.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 fs/xfs/kmem.c |  4 ++--
 fs/xfs/kmem.h | 10 --
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index c557a030acfe..9aa57a4e2478 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -8,7 +8,7 @@
 #include "xfs_trace.h"
 
 void *
-kmem_alloc(size_t size, xfs_km_flags_t flags)
+kmem_alloc_noprof(size_t size, xfs_km_flags_t flags)
 {
int retries = 0;
gfp_t   lflags = kmem_flags_convert(flags);
@@ -17,7 +17,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
trace_kmem_alloc(size, flags, _RET_IP_);
 
do {
-   ptr = kmalloc(size, lflags);
+   ptr = kmalloc_noprof(size, lflags);
if (ptr || (flags & KM_MAYFAIL))
return ptr;
if (!(++retries % 100))
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index b987dc2c6851..c4cf1dc2a7af 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -6,6 +6,7 @@
 #ifndef __XFS_SUPPORT_KMEM_H__
 #define __XFS_SUPPORT_KMEM_H__
 
+#include 
 #include 
 #include 
 #include 
@@ -56,18 +57,15 @@ kmem_flags_convert(xfs_km_flags_t flags)
return lflags;
 }
 
-extern void *kmem_alloc(size_t, xfs_km_flags_t);
 static inline void  kmem_free(const void *ptr)
 {
kvfree(ptr);
 }
 
+extern void *kmem_alloc_noprof(size_t, xfs_km_flags_t);
+#define kmem_alloc(...)
alloc_hooks(kmem_alloc_noprof(__VA_ARGS__))
 
-static inline void *
-kmem_zalloc(size_t size, xfs_km_flags_t flags)
-{
-   return kmem_alloc(size, flags | KM_ZERO);
-}
+#define kmem_zalloc(_size, _flags) kmem_alloc((_size), (_flags) | KM_ZERO)
 
 /*
  * Zone interfaces
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 29/39] mm: percpu: Introduce pcpuobj_ext

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

Upcoming alloc tagging patches require a place to stash per-allocation
metadata.

We already do this when memcg is enabled, so this patch generalizes the
obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext
type, which we will be adding to in an upcoming patch - similarly to the
previous slabobj_ext patch.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Cc: Andrew Morton 
Cc: Dennis Zhou 
Cc: Tejun Heo 
Cc: Christoph Lameter 
Cc: linux...@kvack.org
---
 mm/percpu-internal.h | 19 +--
 mm/percpu.c  | 30 +++---
 2 files changed, 32 insertions(+), 17 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index cdd0aa597a81..e62d582f4bf3 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -32,6 +32,16 @@ struct pcpu_block_md {
int nr_bits;/* total bits responsible for */
 };
 
+struct pcpuobj_ext {
+#ifdef CONFIG_MEMCG_KMEM
+   struct obj_cgroup   *cgroup;
+#endif
+};
+
+#ifdef CONFIG_MEMCG_KMEM
+#define NEED_PCPUOBJ_EXT
+#endif
+
 struct pcpu_chunk {
 #ifdef CONFIG_PERCPU_STATS
int nr_alloc;   /* # of allocations */
@@ -64,8 +74,8 @@ struct pcpu_chunk {
int end_offset; /* additional area required to
   have the region end page
   aligned */
-#ifdef CONFIG_MEMCG_KMEM
-   struct obj_cgroup   **obj_cgroups;  /* vector of object cgroups */
+#ifdef NEED_PCPUOBJ_EXT
+   struct pcpuobj_ext  *obj_exts;  /* vector of object cgroups */
 #endif
 
int nr_pages;   /* # of pages served by this 
chunk */
@@ -74,6 +84,11 @@ struct pcpu_chunk {
unsigned long   populated[];/* populated bitmap */
 };
 
+static inline bool need_pcpuobj_ext(void)
+{
+   return !mem_cgroup_kmem_disabled();
+}
+
 extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
diff --git a/mm/percpu.c b/mm/percpu.c
index a7665de8485f..5a6202acffa3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1392,9 +1392,9 @@ static struct pcpu_chunk * __init 
pcpu_alloc_first_chunk(unsigned long tmp_addr,
panic("%s: Failed to allocate %zu bytes\n", __func__,
  alloc_size);
 
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
/* first chunk is free to use */
-   chunk->obj_cgroups = NULL;
+   chunk->obj_exts = NULL;
 #endif
pcpu_init_md_blocks(chunk);
 
@@ -1463,12 +1463,12 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
if (!chunk->md_blocks)
goto md_blocks_fail;
 
-#ifdef CONFIG_MEMCG_KMEM
-   if (!mem_cgroup_kmem_disabled()) {
-   chunk->obj_cgroups =
+#ifdef NEED_PCPUOBJ_EXT
+   if (need_pcpuobj_ext()) {
+   chunk->obj_exts =
pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
-   sizeof(struct obj_cgroup *), gfp);
-   if (!chunk->obj_cgroups)
+   sizeof(struct pcpuobj_ext), gfp);
+   if (!chunk->obj_exts)
goto objcg_fail;
}
 #endif
@@ -1480,7 +1480,7 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 
return chunk;
 
-#ifdef CONFIG_MEMCG_KMEM
+#ifdef NEED_PCPUOBJ_EXT
 objcg_fail:
pcpu_mem_free(chunk->md_blocks);
 #endif
@@ -1498,8 +1498,8 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
 {
if (!chunk)
return;
-#ifdef CONFIG_MEMCG_KMEM
-   pcpu_mem_free(chunk->obj_cgroups);
+#ifdef NEED_PCPUOBJ_EXT
+   pcpu_mem_free(chunk->obj_exts);
 #endif
pcpu_mem_free(chunk->md_blocks);
pcpu_mem_free(chunk->bound_map);
@@ -1648,8 +1648,8 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup 
*objcg,
if (!objcg)
return;
 
-   if (likely(chunk && chunk->obj_cgroups)) {
-   chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
+   if (likely(chunk && chunk->obj_exts)) {
+   chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
rcu_read_lock();
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1665,13 +1665,13 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk 
*chunk, int off, size_t size)
 {
struct obj_cgroup *objcg;
 
-   if (unlikely(!chunk->obj_cgroups))
+   if (unlikely(!chunk->obj_exts))
return;
 
-   objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
+   objcg = chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup;
if (!objcg)
return;
-   chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
+   c

[PATCH v2 28/39] timekeeping: Fix a circular include dependency

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

This avoids a circular header dependency in an upcoming patch by only
making hrtimer.h depend on percpu-defs.h

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
Cc: Thomas Gleixner 
---
 include/linux/hrtimer.h| 2 +-
 include/linux/time_namespace.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 0ee140176f10..e67349e84364 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -16,7 +16,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 03d9c5ac01d1..a9e61120d4e3 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -11,6 +11,8 @@
 struct user_namespace;
 extern struct user_namespace init_user_ns;
 
+struct vm_area_struct;
+
 struct timens_offsets {
struct timespec64 monotonic;
struct timespec64 boottime;
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 31/39] mm: percpu: enable per-cpu allocation tagging

2023-10-24 Thread Suren Baghdasaryan

Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu
to record allocations and deallocations done by these functions.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h | 15 +
 include/linux/percpu.h| 23 +-
 mm/percpu.c   | 64 +--
 3 files changed, 38 insertions(+), 64 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 6fa8a94d8bc1..3fe51e67e231 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -140,4 +140,19 @@ static inline void alloc_tag_add(union codetag_ref *ref, 
struct alloc_tag *tag,
_res;   \
 })
 
+/*
+ * workaround for a sparse bug: it complains about res_type_to_err() when
+ * typeof(_do_alloc) is a __percpu pointer, but gcc won't let us add a separate
+ * __percpu case to res_type_to_err():
+ */
+#define alloc_hooks_pcpu(_do_alloc)\
+({ \
+   typeof(_do_alloc) _res; \
+   DEFINE_ALLOC_TAG(_alloc_tag, _old); \
+   \
+   _res = _do_alloc;   \
+   alloc_tag_restore(&_alloc_tag, _old);   \
+   _res;   \
+})
+
 #endif /* _LINUX_ALLOC_TAG_H */
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 68fac2e7cbe6..338c1ef9c93d 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_PERCPU_H
 #define __LINUX_PERCPU_H
 
+#include 
 #include 
 #include 
 #include 
@@ -9,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -121,7 +123,6 @@ extern int __init pcpu_page_first_chunk(size_t 
reserved_size,
pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn);
 #endif
 
-extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) 
__alloc_size(1);
 extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long 
*can_addr);
 extern bool is_kernel_percpu_address(unsigned long addr);
 
@@ -129,13 +130,15 @@ extern bool is_kernel_percpu_address(unsigned long addr);
 extern void __init setup_per_cpu_areas(void);
 #endif
 
-extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) 
__alloc_size(1);
-extern void __percpu *__alloc_percpu(size_t size, size_t align) 
__alloc_size(1);
-extern void free_percpu(void __percpu *__pdata);
+extern void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool 
reserved,
+  gfp_t gfp) __alloc_size(1);
 
-DEFINE_FREE(free_percpu, void __percpu *, free_percpu(_T))
-
-extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+#define __alloc_percpu_gfp(_size, _align, _gfp)
\
+   alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, false, _gfp))
+#define __alloc_percpu(_size, _align)  \
+   alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, false, GFP_KERNEL))
+#define __alloc_reserved_percpu(_size, _align) \
+   alloc_hooks_pcpu(pcpu_alloc_noprof(_size, _align, true, GFP_KERNEL))
 
 #define alloc_percpu_gfp(type, gfp)\
(typeof(type) __percpu *)__alloc_percpu_gfp(sizeof(type),   \
@@ -144,6 +147,12 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
(typeof(type) __percpu *)__alloc_percpu(sizeof(type),   \
__alignof__(type))
 
+extern void free_percpu(void __percpu *__pdata);
+
+DEFINE_FREE(free_percpu, void __percpu *, free_percpu(_T))
+
+extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
+
 extern unsigned long pcpu_nr_pages(void);
 
 #endif /* __LINUX_PERCPU_H */
diff --git a/mm/percpu.c b/mm/percpu.c
index 002ee5d38fd5..328a5b3c943b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1728,7 +1728,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk 
*chunk, int off, size_t s
 #endif
 
 /**
- * pcpu_alloc - the percpu allocator
+ * pcpu_alloc_noprof - the percpu allocator
  * @size: size of area to allocate in bytes
  * @align: alignment of area (max PAGE_SIZE)
  * @reserved: allocate from the reserved chunk if available
@@ -1742,7 +1742,7 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk 
*chunk, int off, size_t s
  * RETURNS:
  * Percpu pointer to the allocated area on success, NULL on failure.
  */
-static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
+void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 gfp_t gfp)
 {
gfp_t pcpu_gfp;
@@ -1909,6 +1909,8 @@ static void

[PATCH v2 30/39] mm: percpu: Add codetag reference into pcpuobj_ext

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

To store codetag for every per-cpu allocation, a codetag reference is
embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to
use the newly introduced codetag are added.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 mm/percpu-internal.h | 11 +--
 mm/percpu.c  | 26 ++
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index e62d582f4bf3..7e42f0ca3b7b 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -36,9 +36,12 @@ struct pcpuobj_ext {
 #ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup   *cgroup;
 #endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   union codetag_ref   tag;
+#endif
 };
 
-#ifdef CONFIG_MEMCG_KMEM
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEM_ALLOC_PROFILING)
 #define NEED_PCPUOBJ_EXT
 #endif
 
@@ -86,7 +89,11 @@ struct pcpu_chunk {
 
 static inline bool need_pcpuobj_ext(void)
 {
-   return !mem_cgroup_kmem_disabled();
+   if (IS_ENABLED(CONFIG_MEM_ALLOC_PROFILING))
+   return true;
+   if (!mem_cgroup_kmem_disabled())
+   return true;
+   return false;
 }
 
 extern spinlock_t pcpu_lock;
diff --git a/mm/percpu.c b/mm/percpu.c
index 5a6202acffa3..002ee5d38fd5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1701,6 +1701,32 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk 
*chunk, int off, size_t size)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+   if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts)) {
+   alloc_tag_add(&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].tag,
+ current->alloc_tag, size);
+   }
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t 
size)
+{
+   if (mem_alloc_profiling_enabled() && likely(chunk->obj_exts))
+   alloc_tag_sub_noalloc(&chunk->obj_exts[off >> 
PCPU_MIN_ALLOC_SHIFT].tag, size);
+}
+#else
+static void pcpu_alloc_tag_alloc_hook(struct pcpu_chunk *chunk, int off,
+ size_t size)
+{
+}
+
+static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t 
size)
+{
+}
+#endif
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 32/39] arm64: Fix circular header dependency

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

Replace linux/percpu.h include with asm/percpu.h to avoid circular
dependency.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 arch/arm64/include/asm/spectre.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/spectre.h b/arch/arm64/include/asm/spectre.h
index 9cc501450486..75e837753772 100644
--- a/arch/arm64/include/asm/spectre.h
+++ b/arch/arm64/include/asm/spectre.h
@@ -13,8 +13,8 @@
 #define __BP_HARDEN_HYP_VECS_SZ((BP_HARDEN_EL2_SLOTS - 1) * SZ_2K)
 
 #ifndef __ASSEMBLY__
-
-#include 
+#include 
+#include 
 
 #include 
 #include 
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 35/39] lib: add memory allocations report in show_mem()

2023-10-24 Thread Suren Baghdasaryan

Include allocations in show_mem reports.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h |  2 ++
 lib/alloc_tag.c   | 37 +
 mm/show_mem.c | 15 +++
 3 files changed, 54 insertions(+)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 3fe51e67e231..0a5973c4ad77 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,6 +30,8 @@ struct alloc_tag {
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+void alloc_tags_show_mem_report(struct seq_buf *s);
+
 static inline struct alloc_tag *ct_to_alloc_tag(struct codetag *ct)
 {
return container_of(ct, struct alloc_tag, ct);
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 2d5226d9262d..2f7a2e3ddf55 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -96,6 +96,43 @@ static const struct seq_operations allocinfo_seq_op = {
.show   = allocinfo_show,
 };
 
+void alloc_tags_show_mem_report(struct seq_buf *s)
+{
+   struct codetag_iterator iter;
+   struct codetag *ct;
+   struct {
+   struct codetag  *tag;
+   size_t  bytes;
+   } tags[10], n;
+   unsigned int i, nr = 0;
+
+   codetag_lock_module_list(alloc_tag_cttype, true);
+   iter = codetag_get_ct_iter(alloc_tag_cttype);
+   while ((ct = codetag_next_ct(&iter))) {
+   struct alloc_tag_counters counter = 
alloc_tag_read(ct_to_alloc_tag(ct));
+   n.tag   = ct;
+   n.bytes = counter.bytes;
+
+   for (i = 0; i < nr; i++)
+   if (n.bytes > tags[i].bytes)
+   break;
+
+   if (i < ARRAY_SIZE(tags)) {
+   nr -= nr == ARRAY_SIZE(tags);
+   memmove(&tags[i + 1],
+   &tags[i],
+   sizeof(tags[0]) * (nr - i));
+   nr++;
+   tags[i] = n;
+   }
+   }
+
+   for (i = 0; i < nr; i++)
+   alloc_tag_to_text(s, tags[i].tag);
+
+   codetag_lock_module_list(alloc_tag_cttype, false);
+}
+
 static void __init procfs_init(void)
 {
proc_create_seq("allocinfo", 0444, NULL, &allocinfo_seq_op);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 4b888b18bdde..660e9a78a34d 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -426,4 +427,18 @@ void __show_mem(unsigned int filter, nodemask_t *nodemask, 
int max_zone_idx)
 #ifdef CONFIG_MEMORY_FAILURE
printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
 #endif
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   {
+   struct seq_buf s;
+   char *buf = kmalloc(4096, GFP_ATOMIC);
+
+   if (buf) {
+   printk("Memory allocations:\n");
+   seq_buf_init(&s, buf, 4096);
+   alloc_tags_show_mem_report(&s);
+   printk("%s", buf);
+   kfree(buf);
+   }
+   }
+#endif
 }
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 33/39] mm: vmalloc: Enable memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

This wrapps all external vmalloc allocation functions with the
alloc_hooks() wrapper, and switches internal allocations to _noprof
variants where appropriate, for the new memory allocation profiling
feature.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 drivers/staging/media/atomisp/pci/hmm/hmm.c |  2 +-
 include/linux/vmalloc.h | 60 ++
 kernel/kallsyms_selftest.c  |  2 +-
 mm/util.c   | 24 +++---
 mm/vmalloc.c| 88 ++---
 5 files changed, 103 insertions(+), 73 deletions(-)

diff --git a/drivers/staging/media/atomisp/pci/hmm/hmm.c 
b/drivers/staging/media/atomisp/pci/hmm/hmm.c
index bb12644fd033..3e2899ad8517 100644
--- a/drivers/staging/media/atomisp/pci/hmm/hmm.c
+++ b/drivers/staging/media/atomisp/pci/hmm/hmm.c
@@ -205,7 +205,7 @@ static ia_css_ptr __hmm_alloc(size_t bytes, enum 
hmm_bo_type type,
}
 
dev_dbg(atomisp_dev, "pages: 0x%08x (%zu bytes), type: %d, vmalloc 
%p\n",
-   bo->start, bytes, type, vmalloc);
+   bo->start, bytes, type, vmalloc_noprof);
 
return bo->start;
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..106d78e75606 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_VMALLOC_H
 #define _LINUX_VMALLOC_H
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -137,26 +139,54 @@ extern unsigned long vmalloc_nr_pages(void);
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 #endif
 
-extern void *vmalloc(unsigned long size) __alloc_size(1);
-extern void *vzalloc(unsigned long size) __alloc_size(1);
-extern void *vmalloc_user(unsigned long size) __alloc_size(1);
-extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
-extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
-extern void *vmalloc_32(unsigned long size) __alloc_size(1);
-extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
-extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
-extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
+extern void *vmalloc_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc(...)   alloc_hooks(vmalloc_noprof(__VA_ARGS__))
+
+extern void *vzalloc_noprof(unsigned long size) __alloc_size(1);
+#define vzalloc(...)   alloc_hooks(vzalloc_noprof(__VA_ARGS__))
+
+extern void *vmalloc_user_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_user(...)  alloc_hooks(vmalloc_user_noprof(__VA_ARGS__))
+
+extern void *vmalloc_node_noprof(unsigned long size, int node) __alloc_size(1);
+#define vmalloc_node(...)  alloc_hooks(vmalloc_node_noprof(__VA_ARGS__))
+
+extern void *vzalloc_node_noprof(unsigned long size, int node) __alloc_size(1);
+#define vzalloc_node(...)  alloc_hooks(vzalloc_node_noprof(__VA_ARGS__))
+
+extern void *vmalloc_32_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_32(...)
alloc_hooks(vmalloc_32_noprof(__VA_ARGS__))
+
+extern void *vmalloc_32_user_noprof(unsigned long size) __alloc_size(1);
+#define vmalloc_32_user(...)   alloc_hooks(vmalloc_32_user_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_noprof(unsigned long size, gfp_t gfp_mask) 
__alloc_size(1);
+#define __vmalloc(...) alloc_hooks(__vmalloc_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_node_range_noprof(unsigned long size, unsigned long 
align,
unsigned long start, unsigned long end, gfp_t gfp_mask,
pgprot_t prot, unsigned long vm_flags, int node,
const void *caller) __alloc_size(1);
-void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
+#define __vmalloc_node_range(...)  
alloc_hooks(__vmalloc_node_range_noprof(__VA_ARGS__))
+
+void *__vmalloc_node_noprof(unsigned long size, unsigned long align, gfp_t 
gfp_mask,
int node, const void *caller) __alloc_size(1);
-void *vmalloc_huge(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+#define __vmalloc_node(...)alloc_hooks(__vmalloc_node_noprof(__VA_ARGS__))
+
+void *vmalloc_huge_noprof(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
+#define vmalloc_huge(...)  alloc_hooks(vmalloc_huge_noprof(__VA_ARGS__))
+
+extern void *__vmalloc_array_noprof(size_t n, size_t size, gfp_t flags) 
__alloc_size(1, 2);
+#define __vmalloc_array(...)   alloc_hooks(__vmalloc_array_noprof(__VA_ARGS__))
+
+extern void *vmalloc_array_noprof(size_t n, size_t size) __alloc_size(1, 2);
+#define vmalloc_array(...) alloc_hooks(vmalloc_array_noprof(__VA_ARGS__))
+
+extern void *__vcalloc_noprof(size_t n, size_t size, gfp_t flags) 
__alloc_size(1, 2);
+#define __vcalloc(...) alloc_hooks(__vcalloc_noprof(__VA_ARGS__))
 
-extern void *__vmalloc_array(size_t n,

[PATCH v2 34/39] rhashtable: Plumb through alloc tag

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

This gives better memory allocation profiling results; rhashtable
allocations will be accounted to the code that initialized the
rhashtable.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/rhashtable-types.h | 11 +--
 lib/rhashtable.c | 52 +---
 2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/linux/rhashtable-types.h b/include/linux/rhashtable-types.h
index 57467cbf4c5b..aac2984c2ef0 100644
--- a/include/linux/rhashtable-types.h
+++ b/include/linux/rhashtable-types.h
@@ -9,6 +9,7 @@
 #ifndef _LINUX_RHASHTABLE_TYPES_H
 #define _LINUX_RHASHTABLE_TYPES_H
 
+#include 
 #include 
 #include 
 #include 
@@ -88,6 +89,9 @@ struct rhashtable {
struct mutexmutex;
spinlock_t  lock;
atomic_tnelems;
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+   struct alloc_tag*alloc_tag;
+#endif
 };
 
 /**
@@ -127,9 +131,12 @@ struct rhashtable_iter {
bool end_of_table;
 };
 
-int rhashtable_init(struct rhashtable *ht,
+int rhashtable_init_noprof(struct rhashtable *ht,
const struct rhashtable_params *params);
-int rhltable_init(struct rhltable *hlt,
+#define rhashtable_init(...)   alloc_hooks(rhashtable_init_noprof(__VA_ARGS__))
+
+int rhltable_init_noprof(struct rhltable *hlt,
  const struct rhashtable_params *params);
+#define rhltable_init(...) alloc_hooks(rhltable_init_noprof(__VA_ARGS__))
 
 #endif /* _LINUX_RHASHTABLE_TYPES_H */
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 6ae2ba8e06a2..b62116f332b8 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -63,6 +63,27 @@ EXPORT_SYMBOL_GPL(lockdep_rht_bucket_is_held);
 #define ASSERT_RHT_MUTEX(HT)
 #endif
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING
+static inline void rhashtable_alloc_tag_init(struct rhashtable *ht)
+{
+   ht->alloc_tag = current->alloc_tag;
+}
+
+static inline struct alloc_tag *rhashtable_alloc_tag_save(struct rhashtable 
*ht)
+{
+   return alloc_tag_save(ht->alloc_tag);
+}
+
+static inline void rhashtable_alloc_tag_restore(struct rhashtable *ht, struct 
alloc_tag *old)
+{
+   alloc_tag_restore(ht->alloc_tag, old);
+}
+#else
+#define rhashtable_alloc_tag_init(ht)
+static inline struct alloc_tag *rhashtable_alloc_tag_save(struct rhashtable 
*ht) { return NULL; }
+#define rhashtable_alloc_tag_restore(ht, old)
+#endif
+
 static inline union nested_table *nested_table_top(
const struct bucket_table *tbl)
 {
@@ -130,7 +151,7 @@ static union nested_table *nested_table_alloc(struct 
rhashtable *ht,
if (ntbl)
return ntbl;
 
-   ntbl = kzalloc(PAGE_SIZE, GFP_ATOMIC);
+   ntbl = kmalloc_noprof(PAGE_SIZE, GFP_ATOMIC|__GFP_ZERO);
 
if (ntbl && leaf) {
for (i = 0; i < PAGE_SIZE / sizeof(ntbl[0]); i++)
@@ -157,7 +178,7 @@ static struct bucket_table 
*nested_bucket_table_alloc(struct rhashtable *ht,
 
size = sizeof(*tbl) + sizeof(tbl->buckets[0]);
 
-   tbl = kzalloc(size, gfp);
+   tbl = kmalloc_noprof(size, gfp|__GFP_ZERO);
if (!tbl)
return NULL;
 
@@ -180,8 +201,10 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
size_t size;
int i;
static struct lock_class_key __key;
+   struct alloc_tag * __maybe_unused old = rhashtable_alloc_tag_save(ht);
 
-   tbl = kvzalloc(struct_size(tbl, buckets, nbuckets), gfp);
+   tbl = kvmalloc_node_noprof(struct_size(tbl, buckets, nbuckets),
+  gfp|__GFP_ZERO, NUMA_NO_NODE);
 
size = nbuckets;
 
@@ -190,6 +213,8 @@ static struct bucket_table *bucket_table_alloc(struct 
rhashtable *ht,
nbuckets = 0;
}
 
+   rhashtable_alloc_tag_restore(ht, old);
+
if (tbl == NULL)
return NULL;
 
@@ -975,7 +1000,7 @@ static u32 rhashtable_jhash2(const void *key, u32 length, 
u32 seed)
 }
 
 /**
- * rhashtable_init - initialize a new hash table
+ * rhashtable_init_noprof - initialize a new hash table
  * @ht:hash table to be initialized
  * @params:configuration parameters
  *
@@ -1016,7 +1041,7 @@ static u32 rhashtable_jhash2(const void *key, u32 length, 
u32 seed)
  * .obj_hashfn = my_hash_fn,
  * };
  */
-int rhashtable_init(struct rhashtable *ht,
+int rhashtable_init_noprof(struct rhashtable *ht,
const struct rhashtable_params *params)
 {
struct bucket_table *tbl;
@@ -1031,6 +1056,8 @@ int rhashtable_init(struct rhashtable *ht,
spin_lock_init(&ht->lock);
memcpy(&ht->p, params, sizeof(*params));
 
+   rhashtable_alloc_tag_init(ht);
+
if (params->min_size)
ht->p.min_size = roundup_pow_of_two(params->min_size);
 
@@ -1076,26 +1103,26 @@ int rhashtable_init(struct rhashtab

[PATCH v2 36/39] codetag: debug: skip objext checking when it's for objext itself

2023-10-24 Thread Suren Baghdasaryan

objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have
no corresponding objext themselves (otherwise we would get an infinite
recursion). When freeing these objects their codetag will be empty and
when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false
warnings. Introduce CODETAG_EMPTY special codetag value to mark
allocations which intentionally lack codetag to avoid these warnings.
Set objext codetags to CODETAG_EMPTY before freeing to indicate that
the codetag is expected to be empty.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h | 26 ++
 mm/slab.h | 33 +
 mm/slab_common.c  |  1 +
 3 files changed, 60 insertions(+)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 0a5973c4ad77..1f3207097b03 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -77,6 +77,27 @@ static inline struct alloc_tag_counters 
alloc_tag_read(struct alloc_tag *tag)
return v;
 }
 
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+#define CODETAG_EMPTY  (void *)1
+
+static inline bool is_codetag_empty(union codetag_ref *ref)
+{
+   return ref->ct == CODETAG_EMPTY;
+}
+
+static inline void set_codetag_empty(union codetag_ref *ref)
+{
+   if (ref)
+   ref->ct = CODETAG_EMPTY;
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
 static inline void __alloc_tag_sub(union codetag_ref *ref, size_t bytes)
 {
struct alloc_tag *tag;
@@ -87,6 +108,11 @@ static inline void __alloc_tag_sub(union codetag_ref *ref, 
size_t bytes)
if (!ref || !ref->ct)
return;
 
+   if (is_codetag_empty(ref)) {
+   ref->ct = NULL;
+   return;
+   }
+
tag = ct_to_alloc_tag(ref->ct);
 
this_cpu_sub(tag->counters->bytes, bytes);
diff --git a/mm/slab.h b/mm/slab.h
index 4859ce1f8808..45216bad34b8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -455,6 +455,31 @@ static inline struct slabobj_ext *slab_obj_exts(struct 
slab *slab)
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
gfp_t gfp, bool new_slab);
 
+
+#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts)
+{
+   struct slabobj_ext *slab_exts;
+   struct slab *obj_exts_slab;
+
+   obj_exts_slab = virt_to_slab(obj_exts);
+   slab_exts = slab_obj_exts(obj_exts_slab);
+   if (slab_exts) {
+   unsigned int offs = obj_to_index(obj_exts_slab->slab_cache,
+obj_exts_slab, obj_exts);
+   /* codetag should be NULL */
+   WARN_ON(slab_exts[offs].ref.ct);
+   set_codetag_empty(&slab_exts[offs].ref);
+   }
+}
+
+#else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
+static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+
+#endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
+
 static inline bool need_slab_obj_ext(void)
 {
 #ifdef CONFIG_MEM_ALLOC_PROFILING
@@ -476,6 +501,14 @@ static inline void free_slab_obj_exts(struct slab *slab)
if (!obj_exts)
return;
 
+   /*
+* obj_exts was created with __GFP_NO_OBJ_EXT flag, therefore its
+* corresponding extension will be NULL. alloc_tag_sub() will throw a
+* warning if slab has extensions but the extension of an object is
+* NULL, therefore replace NULL with CODETAG_EMPTY to indicate that
+* the extension for obj_exts is expected to be NULL.
+*/
+   mark_objexts_empty(obj_exts);
kfree(obj_exts);
slab->obj_exts = 0;
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8ef5e47ff6a7..db2cd7afc353 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -246,6 +246,7 @@ int alloc_slab_obj_exts(struct slab *slab, struct 
kmem_cache *s,
 * assign slabobj_exts in parallel. In this case the existing
 * objcg vector should be reused.
 */
+   mark_objexts_empty(vec);
kfree(vec);
return 0;
}
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 38/39] codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations

2023-10-24 Thread Suren Baghdasaryan

If slabobj_ext vector allocation for a slab object fails and later on it
succeeds for another object in the same slab, the slabobj_ext for the
original object will be NULL and will be flagged in case when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled.
Mark failed slabobj_ext vector allocations using a new objext_flags flag
stored in the lower bits of slab->obj_exts. When new allocation succeeds
it marks all tag references in the same slabobj_ext vector as empty to
avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/memcontrol.h |  4 +++-
 mm/slab.h  | 25 +
 mm/slab_common.c   | 22 +++---
 3 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 853a24b5f713..6b680ca424e3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -363,8 +363,10 @@ enum page_memcg_data_flags {
 #endif /* CONFIG_MEMCG */
 
 enum objext_flags {
+   /* slabobj_ext vector failed to allocate */
+   OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
/* the next bit after the last actual flag */
-   __NR_OBJEXTS_FLAGS  = __FIRST_OBJEXT_FLAG,
+   __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
 
 #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
diff --git a/mm/slab.h b/mm/slab.h
index 45216bad34b8..1736268892e6 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -474,9 +474,34 @@ static inline void mark_objexts_empty(struct slabobj_ext 
*obj_exts)
}
 }
 
+static inline void mark_failed_objexts_alloc(struct slab *slab)
+{
+   slab->obj_exts = OBJEXTS_ALLOC_FAIL;
+}
+
+static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
+   struct slabobj_ext *vec, unsigned int objects)
+{
+   /*
+* If vector previously failed to allocate then we have live
+* objects with no tag reference. Mark all references in this
+* vector as empty to avoid warnings later on.
+*/
+   if (obj_exts & OBJEXTS_ALLOC_FAIL) {
+   unsigned int i;
+
+   for (i = 0; i < objects; i++)
+   set_codetag_empty(&vec[i].ref);
+   }
+}
+
+
 #else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
 
 static inline void mark_objexts_empty(struct slabobj_ext *obj_exts) {}
+static inline void mark_failed_objexts_alloc(struct slab *slab) {}
+static inline void handle_failed_objexts_alloc(unsigned long obj_exts,
+   struct slabobj_ext *vec, unsigned int objects) {}
 
 #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index db2cd7afc353..cea73314f919 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -218,29 +218,37 @@ int alloc_slab_obj_exts(struct slab *slab, struct 
kmem_cache *s,
gfp_t gfp, bool new_slab)
 {
unsigned int objects = objs_per_slab(s, slab);
-   unsigned long obj_exts;
-   void *vec;
+   unsigned long new_exts;
+   unsigned long old_exts;
+   struct slabobj_ext *vec;
 
gfp &= ~OBJCGS_CLEAR_MASK;
/* Prevent recursive extension vector allocation */
gfp |= __GFP_NO_OBJ_EXT;
vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
   slab_nid(slab));
-   if (!vec)
+   if (!vec) {
+   /* Mark vectors which failed to allocate */
+   if (new_slab)
+   mark_failed_objexts_alloc(slab);
+
return -ENOMEM;
+   }
 
-   obj_exts = (unsigned long)vec;
+   new_exts = (unsigned long)vec;
 #ifdef CONFIG_MEMCG
-   obj_exts |= MEMCG_DATA_OBJEXTS;
+   new_exts |= MEMCG_DATA_OBJEXTS;
 #endif
+   old_exts = slab->obj_exts;
+   handle_failed_objexts_alloc(old_exts, vec, objects);
if (new_slab) {
/*
 * If the slab is brand new and nobody can yet access its
 * obj_exts, no synchronization is required and obj_exts can
 * be simply assigned.
 */
-   slab->obj_exts = obj_exts;
-   } else if (cmpxchg(&slab->obj_exts, 0, obj_exts)) {
+   slab->obj_exts = new_exts;
+   } else if (cmpxchg(&slab->obj_exts, old_exts, new_exts) != old_exts) {
/*
 * If the slab is already in use, somebody can allocate and
 * assign slabobj_exts in parallel. In this case the existing
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 39/39] MAINTAINERS: Add entries for code tagging and memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

From: Kent Overstreet 

The new code & libraries added are being maintained - mark them as such.

Signed-off-by: Kent Overstreet 
Signed-off-by: Suren Baghdasaryan 
---
 MAINTAINERS | 16 
 1 file changed, 16 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2894f0777537..22e51de42131 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5118,6 +5118,13 @@ S:   Supported
 F: Documentation/process/code-of-conduct-interpretation.rst
 F: Documentation/process/code-of-conduct.rst
 
+CODE TAGGING
+M: Suren Baghdasaryan 
+M: Kent Overstreet 
+S: Maintained
+F: include/linux/codetag.h
+F: lib/codetag.c
+
 COMEDI DRIVERS
 M: Ian Abbott 
 M: H Hartley Sweeten 
@@ -13708,6 +13715,15 @@ F: mm/memblock.c
 F: mm/mm_init.c
 F: tools/testing/memblock/
 
+MEMORY ALLOCATION PROFILING
+M: Suren Baghdasaryan 
+M: Kent Overstreet 
+S: Maintained
+F: include/linux/alloc_tag.h
+F: include/linux/codetag_ctx.h
+F: lib/alloc_tag.c
+F: lib/pgalloc_tag.c
+
 MEMORY CONTROLLER DRIVERS
 M: Krzysztof Kozlowski 
 L: linux-ker...@vger.kernel.org
-- 
2.42.0.758.gaed0368e0e-goog

[PATCH v2 37/39] codetag: debug: mark codetags for reserved pages as empty

2023-10-24 Thread Suren Baghdasaryan

To avoid debug warnings while freeing reserved pages which were not
allocated with usual allocators, mark their codetags as empty before
freeing.
Maybe we can annotate reserved pages correctly and avoid this?

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h   | 2 ++
 include/linux/mm.h  | 8 
 include/linux/pgalloc_tag.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 1f3207097b03..102caf62c2a9 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -95,6 +95,7 @@ static inline void set_codetag_empty(union codetag_ref *ref)
 #else /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
 
 static inline bool is_codetag_empty(union codetag_ref *ref) { return false; }
+static inline void set_codetag_empty(union codetag_ref *ref) {}
 
 #endif /* CONFIG_MEM_ALLOC_PROFILING_DEBUG */
 
@@ -155,6 +156,7 @@ static inline void alloc_tag_sub(union codetag_ref *ref, 
size_t bytes) {}
 static inline void alloc_tag_sub_noalloc(union codetag_ref *ref, size_t bytes) 
{}
 static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag *tag,
 size_t bytes) {}
+static inline void set_codetag_empty(union codetag_ref *ref) {}
 
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..310129414833 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3077,6 +3078,13 @@ extern void reserve_bootmem_region(phys_addr_t start,
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void free_reserved_page(struct page *page)
 {
+   union codetag_ref *ref;
+
+   ref = get_page_tag_ref(page);
+   if (ref) {
+   set_codetag_empty(ref);
+   put_page_tag_ref(ref);
+   }
ClearPageReserved(page);
init_page_count(page);
__free_page(page);
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 0174aff5e871..ae9b0f359264 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -93,6 +93,8 @@ static inline void pgalloc_tag_split(struct page *page, 
unsigned int nr)
 
 #else /* CONFIG_MEM_ALLOC_PROFILING */
 
+static inline union codetag_ref *get_page_tag_ref(struct page *page) { return 
NULL; }
+static inline void put_page_tag_ref(union codetag_ref *ref) {}
 static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
   unsigned int order) {}
 static inline void pgalloc_tag_sub(struct page *page, unsigned int order) {}
-- 
2.42.0.758.gaed0368e0e-goog

Re: [PATCH v2 00/39] Memory allocation profiling

2023-10-24 Thread Suren Baghdasaryan

On Tue, Oct 24, 2023 at 11:29 AM Roman Gushchin
 wrote:
>
> On Tue, Oct 24, 2023 at 06:45:57AM -0700, Suren Baghdasaryan wrote:
> > Updates since the last version [1]
> > - Simplified allocation tagging macros;
> > - Runtime enable/disable sysctl switch (/proc/sys/vm/mem_profiling)
> > instead of kernel command-line option;
> > - CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT to select default enable state;
> > - Changed the user-facing API from debugfs to procfs (/proc/allocinfo);
> > - Removed context capture support to make patch incremental;
> > - Renamed uninstrumented allocation functions to use _noprof suffix;
> > - Added __GFP_LAST_BIT to make the code cleaner;
> > - Removed lazy per-cpu counters; it turned out the memory savings was
> > minimal and not worth the performance impact;
>
> Hello Suren,
>
> > Performance overhead:
> > To evaluate performance we implemented an in-kernel test executing
> > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > affinity set to a specific CPU to minimize the noise. Below is performance
> > comparison between the baseline kernel, profiling when enabled, profiling
> > when disabled and (for comparison purposes) baseline with
> > CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> >
> > kmalloc pgalloc
> > (1 baseline)12.041s 49.190s
> > (2 default disabled)14.970s (+24.33%)   49.684s (+1.00%)
> > (3 default enabled) 16.859s (+40.01%)   56.287s (+14.43%)
> > (4 runtime enabled) 16.983s (+41.04%)   55.760s (+13.36%)
> > (5 memcg)   33.831s (+180.96%)  51.433s (+4.56%)
>
> some recent changes [1] to the kmem accounting should have made it quite a bit
> faster. Would be great if you can provide new numbers for the comparison.
> Maybe with the next revision?
>
> And btw thank you (and Kent): your numbers inspired me to do this kmemcg
> performance work. I expect it still to be ~twice more expensive than your
> stuff because on the memcg side we handle separately charge and statistics,
> but hopefully the difference will be lower.

Yes, I saw them! Well done! I'll definitely update my numbers once the
patches land in their final form.

>
> Thank you!

Thank you for the optimizations!

>
> [1]:
>   patches from next tree, so no stable hashes:
> mm: kmem: reimplement get_obj_cgroup_from_current()
> percpu: scoped objcg protection
> mm: kmem: scoped objcg protection
> mm: kmem: make memcg keep a reference to the original objcg
> mm: kmem: add direct objcg pointer to task_struct
> mm: kmem: optimize get_obj_cgroup_from_current()

Re: [PATCH v2 06/39] mm: enumerate all gfp flags

2023-10-25 Thread Suren Baghdasaryan

On Tue, Oct 24, 2023 at 10:47 PM Petr Tesařík  wrote:
>
> On Tue, 24 Oct 2023 06:46:03 -0700
> Suren Baghdasaryan  wrote:
>
> > Introduce GFP bits enumeration to let compiler track the number of used
> > bits (which depends on the config options) instead of hardcoding them.
> > That simplifies __GFP_BITS_SHIFT calculation.
> > Suggested-by: Petr Tesařík 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/gfp_types.h | 90 +++
> >  1 file changed, 62 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> > index 6583a58670c5..3fbe624763d9 100644
> > --- a/include/linux/gfp_types.h
> > +++ b/include/linux/gfp_types.h
> > @@ -21,44 +21,78 @@ typedef unsigned int __bitwise gfp_t;
> >   * include/trace/events/mmflags.h and tools/perf/builtin-kmem.c
> >   */
> >
> > +enum {
> > + ___GFP_DMA_BIT,
> > + ___GFP_HIGHMEM_BIT,
> > + ___GFP_DMA32_BIT,
> > + ___GFP_MOVABLE_BIT,
> > + ___GFP_RECLAIMABLE_BIT,
> > + ___GFP_HIGH_BIT,
> > + ___GFP_IO_BIT,
> > + ___GFP_FS_BIT,
> > + ___GFP_ZERO_BIT,
> > + ___GFP_UNUSED_BIT,  /* 0x200u unused */
> > + ___GFP_DIRECT_RECLAIM_BIT,
> > + ___GFP_KSWAPD_RECLAIM_BIT,
> > + ___GFP_WRITE_BIT,
> > + ___GFP_NOWARN_BIT,
> > + ___GFP_RETRY_MAYFAIL_BIT,
> > + ___GFP_NOFAIL_BIT,
> > + ___GFP_NORETRY_BIT,
> > + ___GFP_MEMALLOC_BIT,
> > + ___GFP_COMP_BIT,
> > + ___GFP_NOMEMALLOC_BIT,
> > + ___GFP_HARDWALL_BIT,
> > + ___GFP_THISNODE_BIT,
> > + ___GFP_ACCOUNT_BIT,
> > + ___GFP_ZEROTAGS_BIT,
> > +#ifdef CONFIG_KASAN_HW_TAGS
> > + ___GFP_SKIP_ZERO_BIT,
> > + ___GFP_SKIP_KASAN_BIT,
> > +#endif
> > +#ifdef CONFIG_LOCKDEP
> > + ___GFP_NOLOCKDEP_BIT,
> > +#endif
> > + ___GFP_LAST_BIT
> > +};
> > +
> >  /* Plain integer GFP bitmasks. Do not use this directly. */
> > -#define ___GFP_DMA   0x01u
> > -#define ___GFP_HIGHMEM   0x02u
> > -#define ___GFP_DMA32 0x04u
> > -#define ___GFP_MOVABLE   0x08u
> > -#define ___GFP_RECLAIMABLE   0x10u
> > -#define ___GFP_HIGH  0x20u
> > -#define ___GFP_IO0x40u
> > -#define ___GFP_FS0x80u
> > -#define ___GFP_ZERO  0x100u
> > +#define ___GFP_DMA   BIT(___GFP_DMA_BIT)
> > +#define ___GFP_HIGHMEM   BIT(___GFP_HIGHMEM_BIT)
> > +#define ___GFP_DMA32 BIT(___GFP_DMA32_BIT)
> > +#define ___GFP_MOVABLE   BIT(___GFP_MOVABLE_BIT)
> > +#define ___GFP_RECLAIMABLE   BIT(___GFP_RECLAIMABLE_BIT)
> > +#define ___GFP_HIGH  BIT(___GFP_HIGH_BIT)
> > +#define ___GFP_IOBIT(___GFP_IO_BIT)
> > +#define ___GFP_FSBIT(___GFP_FS_BIT)
> > +#define ___GFP_ZERO  BIT(___GFP_ZERO_BIT)
> >  /* 0x200u unused */
>
> This comment can be also removed here, because it is already stated
> above with the definition of ___GFP_UNUSED_BIT.

Ack.

>
> Then again, I think that the GFP bits have never been compacted after
> Neil Brown removed __GFP_ATOMIC with commit 2973d8229b78 simply because
> that would mean changing definitions of all subsequent GFP flags. FWIW
> I am not aware of any code that would depend on the numeric value of
> ___GFP_* macros, so this patch seems like a good opportunity to change
> the numbering and get rid of this unused 0x200u altogether.
>
> @Neil: I have added you to the conversation in case you want to correct
> my understanding of the unused bit.

Hmm. I would prefer to do that in a separate patch even though it
would be a one-line change. Seems safer to me in case something goes
wrong and we have to bisect and revert it. If that sounds ok I'll post
that in the next version.

>
> Other than that LGTM.

Thanks for the review!
Suren.

>
> Petr T
>
> > -#define ___GFP_DIRECT_RECLAIM0x400u
> > -#define ___GFP_KSWAPD_RECLAIM0x800u
> > -#define ___GFP_WRITE 0x1000u
> > -#define ___GFP_NOWARN0x2000u
> > -#define ___GFP_RETRY_MAYFAIL 0x4000u
> > -#define ___GFP_NOFAIL0x8000u
> > -#define ___GFP_NORETRY   0x1u
> > -#define ___GFP_MEMALLOC  0x2u
> > -#define ___GFP_COMP  0x4u
> > -#define ___GFP_NOMEMALLOC0x8u
> > -#define ___GFP_HARDWALL  0x10u
> > -#define ___GFP_THISNODE  0x20u
> > -#define ___GFP_ACCOUNT

Re: [PATCH v2 28/39] timekeeping: Fix a circular include dependency

2023-10-26 Thread Suren Baghdasaryan

On Wed, Oct 25, 2023 at 5:33 PM Thomas Gleixner  wrote:
>
> On Tue, Oct 24 2023 at 06:46, Suren Baghdasaryan wrote:
> > From: Kent Overstreet 
> >
> > This avoids a circular header dependency in an upcoming patch by only
> > making hrtimer.h depend on percpu-defs.h
>
> What's the actual dependency problem?

Sorry for the delay.
When we instrument per-cpu allocations in [1] we need to include
sched.h in percpu.h to be able to use alloc_tag_save(). sched.h
includes hrtimer.h. So, without this change we end up with a circular
inclusion: percpu.h->sched.h->hrtimer.h->percpu.h

[1] https://lore.kernel.org/all/20231024134637.3120277-32-sur...@google.com/

>

Re: [PATCH v6 30/37] mm: vmalloc: Enable memory allocation profiling

2024-03-25 Thread Suren Baghdasaryan

On Mon, Mar 25, 2024 at 10:49 AM SeongJae Park  wrote:
>
> On Mon, 25 Mar 2024 14:56:01 + Suren Baghdasaryan  
> wrote:
>
> > On Sat, Mar 23, 2024 at 6:05 PM SeongJae Park  wrote:
> > >
> > > Hi Suren and Kent,
> > >
> > > On Thu, 21 Mar 2024 09:36:52 -0700 Suren Baghdasaryan  
> > > wrote:
> > >
> > > > From: Kent Overstreet 
> > > >
> > > > This wrapps all external vmalloc allocation functions with the
> > > > alloc_hooks() wrapper, and switches internal allocations to _noprof
> > > > variants where appropriate, for the new memory allocation profiling
> > > > feature.
> > >
> > > I just noticed latest mm-unstable fails running kunit on my machine as 
> > > below.
> > > 'git-bisect' says this is the first commit of the failure.
> > >
> > > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > > [10:59:53] Configuring KUnit Kernel ...
> > > [10:59:53] Building KUnit Kernel ...
> > > Populating config with:
> > > $ make ARCH=um O=../kunit.out/ olddefconfig
> > > Building with:
> > > $ make ARCH=um O=../kunit.out/ --jobs=36
> > > ERROR:root:/usr/bin/ld: arch/um/os-Linux/main.o: in function 
> > > `__wrap_malloc':
> > > main.c:(.text+0x10b): undefined reference to `vmalloc'
> > > collect2: error: ld returned 1 exit status
> > >
> > > Haven't looked into the code yet, but reporting first.  May I ask your 
> > > idea?
> >
> > Hi SeongJae,
> > Looks like we missed adding "#include " inside
> > arch/um/os-Linux/main.c in this patch:
> > https://lore.kernel.org/all/20240321163705.3067592-2-sur...@google.com/.
> > I'll be posing fixes for all 0-day issues found over the weekend and
> > will include a fix for this. In the meantime, to work around it you
> > can add that include yourself. Please let me know if the issue still
> > persists after doing that.
>
> Thank you, Suren.  The change made the error message disappears.  However, it
> introduced another one.

Ok, let me investigate and I'll try to get a fix for it today evening.
Thanks,
Suren.

>
> $ git diff
> diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
> index c8a42ecbd7a2..8fe274e9f3a4 100644
> --- a/arch/um/os-Linux/main.c
> +++ b/arch/um/os-Linux/main.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #define PGD_BOUND (4 * 1024 * 1024)
>  #define STACKSIZE (8 * 1024 * 1024)
> $
> $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> [10:43:13] Configuring KUnit Kernel ...
> [10:43:13] Building KUnit Kernel ...
> Populating config with:
> $ make ARCH=um O=../kunit.out/ olddefconfig
> Building with:
> $ make ARCH=um O=../kunit.out/ --jobs=36
> ERROR:root:In file included from .../arch/um/kernel/asm-offsets.c:1:
> .../arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no previous 
> prototype for ‘foo’ [-Wmissing-prototypes]
> 9 | void foo(void)
>   |  ^~~
> In file included from .../include/linux/alloc_tag.h:8,
>  from .../include/linux/vmalloc.h:5,
>  from .../arch/um/os-Linux/main.c:19:
> .../include/linux/bug.h:5:10: fatal error: asm/bug.h: No such file or 
> directory
> 5 | #include 
>   |  ^~~
> compilation terminated.
>
>
> Thanks,
> SJ
>
> [...]

Re: [PATCH v6 30/37] mm: vmalloc: Enable memory allocation profiling

2024-03-26 Thread Suren Baghdasaryan

On Mon, Mar 25, 2024 at 11:20 AM SeongJae Park  wrote:
>
> On Mon, 25 Mar 2024 10:59:01 -0700 Suren Baghdasaryan  
> wrote:
>
> > On Mon, Mar 25, 2024 at 10:49 AM SeongJae Park  wrote:
> > >
> > > On Mon, 25 Mar 2024 14:56:01 + Suren Baghdasaryan  
> > > wrote:
> > >
> > > > On Sat, Mar 23, 2024 at 6:05 PM SeongJae Park  wrote:
> > > > >
> > > > > Hi Suren and Kent,
> > > > >
> > > > > On Thu, 21 Mar 2024 09:36:52 -0700 Suren Baghdasaryan 
> > > > >  wrote:
> > > > >
> > > > > > From: Kent Overstreet 
> > > > > >
> > > > > > This wrapps all external vmalloc allocation functions with the
> > > > > > alloc_hooks() wrapper, and switches internal allocations to _noprof
> > > > > > variants where appropriate, for the new memory allocation profiling
> > > > > > feature.
> > > > >
> > > > > I just noticed latest mm-unstable fails running kunit on my machine 
> > > > > as below.
> > > > > 'git-bisect' says this is the first commit of the failure.
> > > > >
> > > > > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > > > > [10:59:53] Configuring KUnit Kernel ...
> > > > > [10:59:53] Building KUnit Kernel ...
> > > > > Populating config with:
> > > > > $ make ARCH=um O=../kunit.out/ olddefconfig
> > > > > Building with:
> > > > > $ make ARCH=um O=../kunit.out/ --jobs=36
> > > > > ERROR:root:/usr/bin/ld: arch/um/os-Linux/main.o: in function 
> > > > > `__wrap_malloc':
> > > > > main.c:(.text+0x10b): undefined reference to `vmalloc'
> > > > > collect2: error: ld returned 1 exit status
> > > > >
> > > > > Haven't looked into the code yet, but reporting first.  May I ask 
> > > > > your idea?
> > > >
> > > > Hi SeongJae,
> > > > Looks like we missed adding "#include " inside
> > > > arch/um/os-Linux/main.c in this patch:
> > > > https://lore.kernel.org/all/20240321163705.3067592-2-sur...@google.com/.
> > > > I'll be posing fixes for all 0-day issues found over the weekend and
> > > > will include a fix for this. In the meantime, to work around it you
> > > > can add that include yourself. Please let me know if the issue still
> > > > persists after doing that.
> > >
> > > Thank you, Suren.  The change made the error message disappears.  
> > > However, it
> > > introduced another one.
> >
> > Ok, let me investigate and I'll try to get a fix for it today evening.
>
> Thank you for this kind reply.  Nonetheless, this is not blocking some real
> thing from me.  So, no rush.  Plese take your time :)

I posted a fix here:
https://lore.kernel.org/all/20240326073750.726636-1-sur...@google.com/
Please let me know if this resolves the issue.
Thanks,
Suren.

>
>
> Thanks,
> SJ
>
> > Thanks,
> > Suren.
> >
> > >
> > > $ git diff
> > > diff --git a/arch/um/os-Linux/main.c b/arch/um/os-Linux/main.c
> > > index c8a42ecbd7a2..8fe274e9f3a4 100644
> > > --- a/arch/um/os-Linux/main.c
> > > +++ b/arch/um/os-Linux/main.c
> > > @@ -16,6 +16,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #define PGD_BOUND (4 * 1024 * 1024)
> > >  #define STACKSIZE (8 * 1024 * 1024)
> > > $
> > > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > > [10:43:13] Configuring KUnit Kernel ...
> > > [10:43:13] Building KUnit Kernel ...
> > > Populating config with:
> > > $ make ARCH=um O=../kunit.out/ olddefconfig
> > > Building with:
> > > $ make ARCH=um O=../kunit.out/ --jobs=36
> > > ERROR:root:In file included from .../arch/um/kernel/asm-offsets.c:1:
> > > .../arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no 
> > > previous prototype for ‘foo’ [-Wmissing-prototypes]
> > > 9 | void foo(void)
> > >   |  ^~~
> > > In file included from .../include/linux/alloc_tag.h:8,
> > >  from .../include/linux/vmalloc.h:5,
> > >  from .../arch/um/os-Linux/main.c:19:
> > > .../include/linux/bug.h:5:10: fatal error: asm/bug.h: No such file or 
> > > directory
> > > 5 | #include 
> > >   |  ^~~
> > > compilation terminated.
> > >
> > >
> > > Thanks,
> > > SJ
> > >
> > > [...]
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

[PATCH 0/5] page allocation tag compression

2024-08-19 Thread Suren Baghdasaryan

This patchset implements several improvements:
1. Gracefully handles module unloading while there are used allocations
allocated from that module;
2. Provides an option to reduce memory overhead from storing page
allocation references by indexing allocation tags;
3. Provides an option to store page allocation tag references in the
page flags, removing dependency on page extensions and eliminating the
memory overhead from storing page allocation references (~0.2% of total
system memory).
4. Improves page allocation performance when CONFIG_MEM_ALLOC_PROFILING
is enabled by eliminating page extension lookup. Page allocation
performance overhead is reduced from 14% to 5.5%.

Patch #1 copies module tags into virtually contiguous memory which
serves two purposes:
- Lets us deal with the situation when module is unloaded while there
are still live allocations from that module. Since we are using a copy
version of the tags we can safely unload the module. Space and gaps in
this contiguous memory are managed using a maple tree.
- Enables simple indexing of the tags in the later patches.

Preallocated virtually contiguous memory size can be configured using
max_module_alloc_tags kernel parameter.

Patch #2 is a code cleanup to simplify later changes.

Patch #3 abstracts page allocation tag reference to simplify later
changes.

Patch #4 lets us control page allocation tag reference sizes and
introduces tag indexing.

Patch #5 adds a config to store page allocation tag references inside
page flags if they fit.

Patchset applies to mm-unstable.

Suren Baghdasaryan (5):
  alloc_tag: load module tags into separate continuous memory
  alloc_tag: eliminate alloc_tag_ref_set
  alloc_tag: introduce pgalloc_tag_ref to abstract page tag references
  alloc_tag: make page allocation tag reference size configurable
  alloc_tag: config to store page allocation tag refs in page flags

 .../admin-guide/kernel-parameters.txt |   4 +
 include/asm-generic/codetag.lds.h |  19 ++
 include/linux/alloc_tag.h |  46 ++-
 include/linux/codetag.h   |  38 ++-
 include/linux/mmzone.h|   3 +
 include/linux/page-flags-layout.h |  10 +-
 include/linux/pgalloc_tag.h   | 257 ---
 kernel/module/main.c  |  67 ++--
 lib/Kconfig.debug |  36 ++-
 lib/alloc_tag.c   | 300 --
 lib/codetag.c | 105 +-
 mm/mm_init.c  |   1 +
 mm/page_ext.c |   2 +-
 scripts/module.lds.S  |   5 +-
 14 files changed, 759 insertions(+), 134 deletions(-)


base-commit: 651c8c1d735983040bec4f71d0e2e690f3c0fc2d
-- 
2.46.0.184.g6999bdac58-goog

[PATCH 1/5] alloc_tag: load module tags into separate continuous memory

2024-08-19 Thread Suren Baghdasaryan

When a module gets unloaded there is a possibility that some of the
allocations it made are still used and therefore the allocation tags
corresponding to these allocations are still referenced. As such, the
memory for these tags can't be freed. This is currently handled as an
abnormal situation and module's data section is not being unloaded.
To handle this situation without keeping module's data in memory,
allow codetags with longer lifespan than the module to be loaded into
their own separate memory. The in-use memory areas and gaps after
module unloading in this separate memory are tracked using maple trees.
Allocation tags arrange their separate memory so that it is virtually
contiguous and that will allow simple allocation tag indexing later on
in this patchset. The size of this virtually contiguous memory is set
to store up to 10 allocation tags and max_module_alloc_tags kernel
parameter is introduced to change this size.

Signed-off-by: Suren Baghdasaryan 
---
 .../admin-guide/kernel-parameters.txt |   4 +
 include/asm-generic/codetag.lds.h |  19 ++
 include/linux/alloc_tag.h |  13 +-
 include/linux/codetag.h   |  35 ++-
 kernel/module/main.c  |  67 +++--
 lib/alloc_tag.c   | 245 --
 lib/codetag.c | 101 +++-
 scripts/module.lds.S  |   5 +-
 8 files changed, 429 insertions(+), 60 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index d0d141d50638..17f9f811a9c0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3205,6 +3205,10 @@
devices can be requested on-demand with the
/dev/loop-control interface.
 
+
+   max_module_alloc_tags=  [KNL] Max supported number of allocation tags
+   in modules.
+
mce [X86-32] Machine Check Exception
 
mce=option  [X86-64] See 
Documentation/arch/x86/x86_64/boot-options.rst
diff --git a/include/asm-generic/codetag.lds.h 
b/include/asm-generic/codetag.lds.h
index 64f536b80380..372c320c5043 100644
--- a/include/asm-generic/codetag.lds.h
+++ b/include/asm-generic/codetag.lds.h
@@ -11,4 +11,23 @@
 #define CODETAG_SECTIONS() \
SECTION_WITH_BOUNDARIES(alloc_tags)
 
+/*
+ * Module codetags which aren't used after module unload, therefore have the
+ * same lifespan as the module and can be safely unloaded with the module.
+ */
+#define MOD_CODETAG_SECTIONS()
+
+#define MOD_SEPARATE_CODETAG_SECTION(_name)\
+   .codetag.##_name : {\
+   SECTION_WITH_BOUNDARIES(_name)  \
+   }
+
+/*
+ * For codetags which might be used after module unload, therefore might stay
+ * longer in memory. Each such codetag type has its own section so that we can
+ * unload them individually once unused.
+ */
+#define MOD_SEPARATE_CODETAG_SECTIONS()\
+   MOD_SEPARATE_CODETAG_SECTION(alloc_tags)
+
 #endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 8c61ccd161ba..99cbc7f086ad 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,6 +30,13 @@ struct alloc_tag {
struct alloc_tag_counters __percpu  *counters;
 } __aligned(8);
 
+struct alloc_tag_module_section {
+   unsigned long start_addr;
+   unsigned long end_addr;
+   /* used size */
+   unsigned long size;
+};
+
 #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
 
 #define CODETAG_EMPTY  ((void *)1)
@@ -54,6 +61,8 @@ static inline void set_codetag_empty(union codetag_ref *ref) 
{}
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+#define ALLOC_TAG_SECTION_NAME "alloc_tags"
+
 struct codetag_bytes {
struct codetag *ct;
s64 bytes;
@@ -76,7 +85,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag);
 
 #define DEFINE_ALLOC_TAG(_alloc_tag)   
\
static struct alloc_tag _alloc_tag __used __aligned(8)  
\
-   __section("alloc_tags") = { 
\
+   __section(ALLOC_TAG_SECTION_NAME) = {   
\
.ct = CODE_TAG_INIT,
\
.counters = &_shared_alloc_tag };
 
@@ -85,7 +94,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag);
 #define DEFINE_ALLOC_TAG(_alloc_tag)   
\
static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);  
\
static struct alloc_tag _alloc_tag __used __aligned(8)  
\
-   __section("alloc_tags") = {

[PATCH 2/5] alloc_tag: eliminate alloc_tag_ref_set

2024-08-19 Thread Suren Baghdasaryan

To simplify further refactoring, open-code the only two callers
of alloc_tag_ref_set().

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h   | 25 ++---
 include/linux/pgalloc_tag.h | 12 +++-
 2 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 99cbc7f086ad..21e3098220e3 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -143,36 +143,15 @@ static inline void alloc_tag_add_check(union codetag_ref 
*ref, struct alloc_tag
 static inline void alloc_tag_sub_check(union codetag_ref *ref) {}
 #endif
 
-/* Caller should verify both ref and tag to be valid */
-static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct 
alloc_tag *tag)
-{
-   ref->ct = &tag->ct;
-   /*
-* We need in increment the call counter every time we have a new
-* allocation or when we split a large allocation into smaller ones.
-* Each new reference for every sub-allocation needs to increment call
-* counter because when we free each part the counter will be 
decremented.
-*/
-   this_cpu_inc(tag->counters->calls);
-}
-
-static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag 
*tag)
-{
-   alloc_tag_add_check(ref, tag);
-   if (!ref || !tag)
-   return;
-
-   __alloc_tag_ref_set(ref, tag);
-}
-
 static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag 
*tag, size_t bytes)
 {
alloc_tag_add_check(ref, tag);
if (!ref || !tag)
return;
 
-   __alloc_tag_ref_set(ref, tag);
+   ref->ct = &tag->ct;
this_cpu_add(tag->counters->bytes, bytes);
+   this_cpu_inc(tag->counters->calls);
 }
 
 static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 207f0c83c8e9..244a328dff62 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -103,7 +103,17 @@ static inline void pgalloc_tag_split(struct page *page, 
unsigned int nr)
page_ext = page_ext_next(page_ext);
for (i = 1; i < nr; i++) {
/* Set new reference to point to the original tag */
-   alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag);
+   ref = codetag_ref_from_page_ext(page_ext);
+   alloc_tag_add_check(ref, tag);
+   if (ref) {
+   ref->ct = &tag->ct;
+   /*
+* We need in increment the call counter every time we 
split a
+* large allocation into smaller ones because when we 
free each
+* part the counter will be decremented.
+*/
+   this_cpu_inc(tag->counters->calls);
+   }
page_ext = page_ext_next(page_ext);
}
 out:
-- 
2.46.0.184.g6999bdac58-goog

[PATCH 3/5] alloc_tag: introduce pgalloc_tag_ref to abstract page tag references

2024-08-19 Thread Suren Baghdasaryan

To simplify later changes to page tag references, introduce new
pgalloc_tag_ref and pgtag_ref_handle types. This allows easy
replacement of page_ext as a storage of page allocation tags

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/pgalloc_tag.h | 144 +++-
 lib/alloc_tag.c |   3 +-
 2 files changed, 95 insertions(+), 52 deletions(-)

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 244a328dff62..c76b629d0206 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -9,48 +9,76 @@
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+typedef union codetag_ref  pgalloc_tag_ref;
+
+static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   ref->ct = pgref->ct;
+}
+
+static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   pgref->ct = ref->ct;
+}
 #include 
 
 extern struct page_ext_operations page_alloc_tagging_ops;
 
-static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext 
*page_ext)
+static inline pgalloc_tag_ref *pgref_from_page_ext(struct page_ext *page_ext)
 {
-   return (union codetag_ref *)page_ext_data(page_ext, 
&page_alloc_tagging_ops);
+   return (pgalloc_tag_ref *)page_ext_data(page_ext, 
&page_alloc_tagging_ops);
 }
 
-static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref 
*ref)
+static inline struct page_ext *page_ext_from_pgref(pgalloc_tag_ref *pgref)
 {
-   return (void *)ref - page_alloc_tagging_ops.offset;
+   return (void *)pgref - page_alloc_tagging_ops.offset;
 }
 
+typedef pgalloc_tag_ref*pgtag_ref_handle;
+
 /* Should be called only if mem_alloc_profiling_enabled() */
-static inline union codetag_ref *get_page_tag_ref(struct page *page)
+static inline pgtag_ref_handle get_page_tag_ref(struct page *page, union 
codetag_ref *ref)
 {
if (page) {
struct page_ext *page_ext = page_ext_get(page);
 
-   if (page_ext)
-   return codetag_ref_from_page_ext(page_ext);
+   if (page_ext) {
+   pgalloc_tag_ref *pgref = pgref_from_page_ext(page_ext);
+
+   read_pgref(pgref, ref);
+   return pgref;
+   }
}
return NULL;
 }
 
-static inline void put_page_tag_ref(union codetag_ref *ref)
+static inline void put_page_tag_ref(pgtag_ref_handle pgref)
 {
-   if (WARN_ON(!ref))
+   if (WARN_ON(!pgref))
return;
 
-   page_ext_put(page_ext_from_codetag_ref(ref));
+   page_ext_put(page_ext_from_pgref(pgref));
+}
+
+static inline void update_page_tag_ref(pgtag_ref_handle pgref, union 
codetag_ref *ref)
+{
+   if (WARN_ON(!pgref || !ref))
+   return;
+
+   write_pgref(pgref, ref);
 }
 
 static inline void clear_page_tag_ref(struct page *page)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   set_codetag_empty(ref);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   set_codetag_empty(&ref);
+   update_page_tag_ref(handle, &ref);
+   put_page_tag_ref(handle);
}
}
 }
@@ -59,11 +87,14 @@ static inline void pgalloc_tag_add(struct page *page, 
struct task_struct *task,
   unsigned int nr)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE * nr);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr);
+   update_page_tag_ref(handle, &ref);
+   put_page_tag_ref(handle);
}
}
 }
@@ -71,53 +102,58 @@ static inline void pgalloc_tag_add(struct page *page, 
struct task_struct *task,
 static inline void pgalloc_tag_sub(struct page *page, unsigned int nr)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   alloc_tag_sub(ref, PAGE_SIZE * nr);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   alloc_tag_sub(&ref, PA

[PATCH 4/5] alloc_tag: make page allocation tag reference size configurable

2024-08-19 Thread Suren Baghdasaryan

Introduce CONFIG_PGALLOC_TAG_REF_BITS to control the size of the
page allocation tag references. When the size is configured to be
less than a direct pointer, the tags are searched using an index
stored as the tag reference.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h   | 10 +-
 include/linux/codetag.h |  3 ++
 include/linux/pgalloc_tag.h | 69 +
 lib/Kconfig.debug   | 11 ++
 lib/alloc_tag.c | 50 ++-
 lib/codetag.c   |  4 +--
 mm/mm_init.c|  1 +
 7 files changed, 144 insertions(+), 4 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 21e3098220e3..b5cf24517333 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,8 +30,16 @@ struct alloc_tag {
struct alloc_tag_counters __percpu  *counters;
 } __aligned(8);
 
+struct alloc_tag_kernel_section {
+   struct alloc_tag *first_tag;
+   unsigned long count;
+};
+
 struct alloc_tag_module_section {
-   unsigned long start_addr;
+   union {
+   unsigned long start_addr;
+   struct alloc_tag *first_tag;
+   };
unsigned long end_addr;
/* used size */
unsigned long size;
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index c4a3dd60205e..dafc59838d87 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -13,6 +13,9 @@ struct codetag_module;
 struct seq_buf;
 struct module;
 
+#define CODETAG_SECTION_START_PREFIX   "__start_"
+#define CODETAG_SECTION_STOP_PREFIX"__stop_"
+
 /*
  * An instance of this structure is created in a special ELF section at every
  * code location being tagged.  At runtime, the special section is treated as
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index c76b629d0206..80b8801cb90b 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -9,7 +9,18 @@
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+#if !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS > 32
+#define PGALLOC_TAG_DIRECT_REF
 typedef union codetag_ref  pgalloc_tag_ref;
+#else /* !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS 
> 32 */
+#if CONFIG_PGALLOC_TAG_REF_BITS > 16
+typedef u32pgalloc_tag_ref;
+#else
+typedef u16pgalloc_tag_ref;
+#endif
+#endif /* !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS 
> 32 */
+
+#ifdef PGALLOC_TAG_DIRECT_REF
 
 static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
 {
@@ -20,6 +31,63 @@ static inline void write_pgref(pgalloc_tag_ref *pgref, union 
codetag_ref *ref)
 {
pgref->ct = ref->ct;
 }
+
+static inline void alloc_tag_sec_init(void) {}
+
+#else /* PGALLOC_TAG_DIRECT_REF */
+
+extern struct alloc_tag_kernel_section kernel_tags;
+extern struct alloc_tag_module_section module_tags;
+
+#define CODETAG_ID_NULL0
+#define CODETAG_ID_EMPTY   1
+#define CODETAG_ID_FIRST   2
+
+static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   pgalloc_tag_ref idx = *pgref;
+
+   switch (idx) {
+   case (CODETAG_ID_NULL):
+   ref->ct = NULL;
+   break;
+   case (CODETAG_ID_EMPTY):
+   set_codetag_empty(ref);
+   break;
+   default:
+   idx -= CODETAG_ID_FIRST;
+   ref->ct = idx < kernel_tags.count ?
+   &kernel_tags.first_tag[idx].ct :
+   &module_tags.first_tag[idx - kernel_tags.count].ct;
+   }
+}
+
+static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   struct alloc_tag *tag;
+
+   if (!ref->ct) {
+   *pgref = CODETAG_ID_NULL;
+   return;
+   }
+
+   if (is_codetag_empty(ref)) {
+   *pgref = CODETAG_ID_EMPTY;
+   return;
+   }
+
+   tag = ct_to_alloc_tag(ref->ct);
+   if (tag >= kernel_tags.first_tag && tag < kernel_tags.first_tag + 
kernel_tags.count) {
+   *pgref = CODETAG_ID_FIRST + (tag - kernel_tags.first_tag);
+   return;
+   }
+
+   *pgref = CODETAG_ID_FIRST + kernel_tags.count + (tag - 
module_tags.first_tag);
+}
+
+void __init alloc_tag_sec_init(void);
+
+#endif /* PGALLOC_TAG_DIRECT_REF */
 #include 
 
 extern struct page_ext_operations page_alloc_tagging_ops;
@@ -197,6 +265,7 @@ static inline void pgalloc_tag_sub(struct page *page, 
unsigned int nr) {}
 static inline void pgalloc_tag_split(struct page *page, unsigned int nr) {}
 static inline struct alloc_tag *pgalloc_tag_get(struct page *page) { return 
NULL; }
 static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int 
nr) {}
+static inline void alloc_tag_sec_init(void) {}
 
 #endif /* CONFIG_MEM_ALLOC_PROFILING */
 
diff --g

[PATCH 5/5] alloc_tag: config to store page allocation tag refs in page flags

2024-08-19 Thread Suren Baghdasaryan

Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
references directly in the page flags. This removes dependency on
page_ext and results in better performance for page allocations as
well as reduced page_ext memory overhead.
CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
to be available in the page flags to store the references. If the
number of page flag bits is insufficient, the build will fail and
either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mmzone.h|  3 ++
 include/linux/page-flags-layout.h | 10 +--
 include/linux/pgalloc_tag.h   | 48 +++
 lib/Kconfig.debug | 27 +++--
 lib/alloc_tag.c   |  4 +++
 mm/page_ext.c |  2 +-
 6 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..0dd2b42f7cb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1085,6 +1085,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define KASAN_TAG_PGOFF(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
 #define LRU_GEN_PGOFF  (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
 #define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
+#define ALLOC_TAG_REF_PGOFF(LRU_REFS_PGOFF - ALLOC_TAG_REF_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -1096,6 +1097,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define ZONES_PGSHIFT  (ZONES_PGOFF * (ZONES_WIDTH != 0))
 #define LAST_CPUPID_PGSHIFT(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 #define KASAN_TAG_PGSHIFT  (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0))
+#define ALLOC_TAG_REF_PGSHIFT  (ALLOC_TAG_REF_PGOFF * (ALLOC_TAG_REF_WIDTH != 
0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -1116,6 +1118,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define LAST_CPUPID_MASK   ((1UL << LAST_CPUPID_SHIFT) - 1)
 #define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1)
 #define ZONEID_MASK((1UL << ZONEID_SHIFT) - 1)
+#define ALLOC_TAG_REF_MASK ((1UL << ALLOC_TAG_REF_WIDTH) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
diff --git a/include/linux/page-flags-layout.h 
b/include/linux/page-flags-layout.h
index 7d79818dc065..21bba7c8c965 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -5,6 +5,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+#define ALLOC_TAG_REF_WIDTHCONFIG_PGALLOC_TAG_REF_BITS
+#else
+#define ALLOC_TAG_REF_WIDTH0
+#endif
+
 /*
  * When a memory allocation must conform to specific limitations (such
  * as being suitable for DMA) the caller will pass in hints to the
@@ -91,7 +97,7 @@
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-   KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+   KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_SHIFT <= 
BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -102,7 +108,7 @@
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-   KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
+   KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_WIDTH > 
BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 80b8801cb90b..da95c09bcdf1 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -88,6 +88,52 @@ static inline void write_pgref(pgalloc_tag_ref *pgref, union 
codetag_ref *ref)
 void __init alloc_tag_sec_init(void);
 
 #endif /* PGALLOC_TAG_DIRECT_REF */
+
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+typedef struct page*pgtag_ref_handle;
+
+/* Should be called only if mem_alloc_profiling_enabled() */
+static inline pgtag_ref_handle get_page_tag_ref(struct page *page,
+   union codetag_ref *ref)
+{
+   if (page) {
+   pgalloc_tag_ref pgref;
+
+   pgref = (page->flags >> ALLOC_TAG_REF_PGSHIFT) & 
ALLOC_TAG_REF_MASK;
+   read_pgref(&pgref, ref);
+   return page;
+   }
+
+   return NULL;
+}
+
+static inline void put_page_tag_ref(pgtag_ref_handle page)
+{
+   WARN_ON(!page);
+}
+
+static inline void update_page_tag_ref(pgtag_ref_handle page, union 
codetag_ref *ref)
+{
+   unsigned long old_flags, flags, val;
+   pgalloc_tag_ref pgref;
+
+   if (WARN_ON(!page || !ref))
+   return;
+
+   write_pgref(&pgref, ref);
+

Re: [PATCH 5/5] alloc_tag: config to store page allocation tag refs in page flags

2024-08-19 Thread Suren Baghdasaryan

On Mon, Aug 19, 2024 at 12:34 PM Matthew Wilcox  wrote:
>
> On Mon, Aug 19, 2024 at 08:15:11AM -0700, Suren Baghdasaryan wrote:
> > @@ -91,7 +97,7 @@
> >  #endif
> >
> >  #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
> > - KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
> > + KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_SHIFT <= 
> > BITS_PER_LONG - NR_PAGEFLAGS
> >  #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
> >  #else
> >  #define LAST_CPUPID_WIDTH 0
>
> So if ALLOC_TAG_REF_WIDTH is big enough, it's going to force last_cpupid
> into struct page.

Thanks for taking a look!
Yes, but how is this field different from say KASAN_TAG_WIDTH which
can also force last_cpupid out of page flags?

>  That will misalign struct page and disable HVO -- with no warning!

mminit_verify_pageflags_layout already has a mminit_dprintk() to
indicate this condition. Is that not enough?

>

Re: [PATCH 5/5] alloc_tag: config to store page allocation tag refs in page flags

2024-08-19 Thread Suren Baghdasaryan

On Mon, Aug 19, 2024 at 6:46 PM Andrew Morton  wrote:
>
> On Mon, 19 Aug 2024 21:40:34 +0100 Matthew Wilcox  wrote:
>
> > On Mon, Aug 19, 2024 at 01:39:16PM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Aug 19, 2024 at 12:34 PM Matthew Wilcox  
> > > wrote:
> > > > So if ALLOC_TAG_REF_WIDTH is big enough, it's going to force last_cpupid
> > > > into struct page.
> > >
> > > Thanks for taking a look!
> > > Yes, but how is this field different from say KASAN_TAG_WIDTH which
> > > can also force last_cpupid out of page flags?
> >
> > Because KASAN isn't for production use?
> >
> > > >  That will misalign struct page and disable HVO -- with no warning!
> > >
> > > mminit_verify_pageflags_layout already has a mminit_dprintk() to
> > > indicate this condition. Is that not enough?
> >
> > Fair.
>
> Is a BUILD_BUG_ON() feasible here?

We could, but I didn't think we should prevent people from having such
a configuration if that's what they need...

Re: [PATCH 1/5] alloc_tag: load module tags into separate continuous memory

2024-08-20 Thread Suren Baghdasaryan

On Tue, Aug 20, 2024 at 12:13 AM Mike Rapoport  wrote:
>
> On Mon, Aug 19, 2024 at 08:15:07AM -0700, Suren Baghdasaryan wrote:
> > When a module gets unloaded there is a possibility that some of the
> > allocations it made are still used and therefore the allocation tags
> > corresponding to these allocations are still referenced. As such, the
> > memory for these tags can't be freed. This is currently handled as an
> > abnormal situation and module's data section is not being unloaded.
> > To handle this situation without keeping module's data in memory,
> > allow codetags with longer lifespan than the module to be loaded into
> > their own separate memory. The in-use memory areas and gaps after
> > module unloading in this separate memory are tracked using maple trees.
> > Allocation tags arrange their separate memory so that it is virtually
> > contiguous and that will allow simple allocation tag indexing later on
> > in this patchset. The size of this virtually contiguous memory is set
> > to store up to 10 allocation tags and max_module_alloc_tags kernel
> > parameter is introduced to change this size.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  .../admin-guide/kernel-parameters.txt |   4 +
> >  include/asm-generic/codetag.lds.h |  19 ++
> >  include/linux/alloc_tag.h |  13 +-
> >  include/linux/codetag.h   |  35 ++-
> >  kernel/module/main.c  |  67 +++--
> >  lib/alloc_tag.c   | 245 --
> >  lib/codetag.c | 101 +++-
> >  scripts/module.lds.S  |   5 +-
> >  8 files changed, 429 insertions(+), 60 deletions(-)
>
> ...
>
> > diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> > index c2a579ccd455..c4a3dd60205e 100644
> > --- a/include/linux/codetag.h
> > +++ b/include/linux/codetag.h
> > @@ -35,8 +35,13 @@ struct codetag_type_desc {
> >   size_t tag_size;
> >   void (*module_load)(struct codetag_type *cttype,
> >   struct codetag_module *cmod);
> > - bool (*module_unload)(struct codetag_type *cttype,
> > + void (*module_unload)(struct codetag_type *cttype,
> > struct codetag_module *cmod);
> > + void (*module_replaced)(struct module *mod, struct module *new_mod);
> > + bool (*needs_section_mem)(struct module *mod, unsigned long size);
> > + void *(*alloc_section_mem)(struct module *mod, unsigned long size,
> > +unsigned int prepend, unsigned long align);
> > + void (*free_section_mem)(struct module *mod, bool unused);
> >  };
> >
> >  struct codetag_iterator {
> > @@ -71,11 +76,31 @@ struct codetag_type *
> >  codetag_register_type(const struct codetag_type_desc *desc);
> >
> >  #if defined(CONFIG_CODE_TAGGING) && defined(CONFIG_MODULES)
> > +
> > +bool codetag_needs_module_section(struct module *mod, const char *name,
> > +   unsigned long size);
> > +void *codetag_alloc_module_section(struct module *mod, const char *name,
> > +unsigned long size, unsigned int prepend,
> > +unsigned long align);
> > +void codetag_free_module_sections(struct module *mod);
> > +void codetag_module_replaced(struct module *mod, struct module *new_mod);
> >  void codetag_load_module(struct module *mod);
> > -bool codetag_unload_module(struct module *mod);
> > -#else
> > +void codetag_unload_module(struct module *mod);
> > +
> > +#else /* defined(CONFIG_CODE_TAGGING) && defined(CONFIG_MODULES) */
> > +
> > +static inline bool
> > +codetag_needs_module_section(struct module *mod, const char *name,
> > +  unsigned long size) { return false; }
> > +static inline void *
> > +codetag_alloc_module_section(struct module *mod, const char *name,
> > +  unsigned long size, unsigned int prepend,
> > +  unsigned long align) { return NULL; }
> > +static inline void codetag_free_module_sections(struct module *mod) {}
> > +static inline void codetag_module_replaced(struct module *mod, struct 
> > module *new_mod) {}
> >  static inline void codetag_load_module(struct module *mod) {}
> > -static inline bool codetag_unload_module(struct module *mod) { return 
> > true; }
> > -#endif
> > +static inline void codetag_unload_module(struct module *m

Re: [PATCH 1/5] alloc_tag: load module tags into separate continuous memory

2024-08-20 Thread Suren Baghdasaryan

On Tue, Aug 20, 2024 at 8:31 AM 'Liam R. Howlett' via kernel-team
 wrote:
>
> * Suren Baghdasaryan  [240819 11:15]:
> > When a module gets unloaded there is a possibility that some of the
> > allocations it made are still used and therefore the allocation tags
> > corresponding to these allocations are still referenced. As such, the
> > memory for these tags can't be freed. This is currently handled as an
> > abnormal situation and module's data section is not being unloaded.
> > To handle this situation without keeping module's data in memory,
> > allow codetags with longer lifespan than the module to be loaded into
> > their own separate memory. The in-use memory areas and gaps after
> > module unloading in this separate memory are tracked using maple trees.
> > Allocation tags arrange their separate memory so that it is virtually
> > contiguous and that will allow simple allocation tag indexing later on
> > in this patchset. The size of this virtually contiguous memory is set
> > to store up to 10 allocation tags and max_module_alloc_tags kernel
> > parameter is introduced to change this size.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  .../admin-guide/kernel-parameters.txt |   4 +
> >  include/asm-generic/codetag.lds.h |  19 ++
> >  include/linux/alloc_tag.h |  13 +-
> >  include/linux/codetag.h   |  35 ++-
> >  kernel/module/main.c  |  67 +++--
> >  lib/alloc_tag.c   | 245 --
> >  lib/codetag.c | 101 +++-
> >  scripts/module.lds.S  |   5 +-
> >  8 files changed, 429 insertions(+), 60 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> > b/Documentation/admin-guide/kernel-parameters.txt
> > index d0d141d50638..17f9f811a9c0 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -3205,6 +3205,10 @@
> >   devices can be requested on-demand with the
> >   /dev/loop-control interface.
> >
> > +
> > + max_module_alloc_tags=  [KNL] Max supported number of allocation tags
> > + in modules.
> > +
> >   mce [X86-32] Machine Check Exception
> >
> >   mce=option  [X86-64] See 
> > Documentation/arch/x86/x86_64/boot-options.rst
> > diff --git a/include/asm-generic/codetag.lds.h 
> > b/include/asm-generic/codetag.lds.h
> > index 64f536b80380..372c320c5043 100644
> > --- a/include/asm-generic/codetag.lds.h
> > +++ b/include/asm-generic/codetag.lds.h
> > @@ -11,4 +11,23 @@
> >  #define CODETAG_SECTIONS()   \
> >   SECTION_WITH_BOUNDARIES(alloc_tags)
> >
> > +/*
> > + * Module codetags which aren't used after module unload, therefore have 
> > the
> > + * same lifespan as the module and can be safely unloaded with the module.
> > + */
> > +#define MOD_CODETAG_SECTIONS()
> > +
> > +#define MOD_SEPARATE_CODETAG_SECTION(_name)  \
> > + .codetag.##_name : {\
> > + SECTION_WITH_BOUNDARIES(_name)  \
> > + }
> > +
> > +/*
> > + * For codetags which might be used after module unload, therefore might 
> > stay
> > + * longer in memory. Each such codetag type has its own section so that we 
> > can
> > + * unload them individually once unused.
> > + */
> > +#define MOD_SEPARATE_CODETAG_SECTIONS()  \
> > + MOD_SEPARATE_CODETAG_SECTION(alloc_tags)
> > +
> >  #endif /* __ASM_GENERIC_CODETAG_LDS_H */
> > diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> > index 8c61ccd161ba..99cbc7f086ad 100644
> > --- a/include/linux/alloc_tag.h
> > +++ b/include/linux/alloc_tag.h
> > @@ -30,6 +30,13 @@ struct alloc_tag {
> >   struct alloc_tag_counters __percpu  *counters;
> >  } __aligned(8);
> >
> > +struct alloc_tag_module_section {
> > + unsigned long start_addr;
> > + unsigned long end_addr;
> > + /* used size */
> > + unsigned long size;
> > +};
> > +
> >  #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> >
> >  #define CODETAG_EMPTY((void *)1)
> > @@ -54,6 +61,8 @@ static inline void set_codetag_empty(union codetag_ref 
> > *ref) {}
> >
> >  #ifdef CONFIG_MEM_ALLOC_PROFILING
> &g

[PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-01 Thread Suren Baghdasaryan

Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
references directly in the page flags. This removes dependency on
page_ext and results in better performance for page allocations as
well as reduced page_ext memory overhead.
CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
to be available in the page flags to store the references. If the
number of page flag bits is insufficient, the build will fail and
either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mmzone.h|  3 ++
 include/linux/page-flags-layout.h | 10 +--
 include/linux/pgalloc_tag.h   | 48 +++
 lib/Kconfig.debug | 27 +++--
 lib/alloc_tag.c   |  4 +++
 mm/page_ext.c |  2 +-
 6 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 17506e4a2835..0dd2b42f7cb6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1085,6 +1085,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define KASAN_TAG_PGOFF(LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
 #define LRU_GEN_PGOFF  (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
 #define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
+#define ALLOC_TAG_REF_PGOFF(LRU_REFS_PGOFF - ALLOC_TAG_REF_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -1096,6 +1097,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define ZONES_PGSHIFT  (ZONES_PGOFF * (ZONES_WIDTH != 0))
 #define LAST_CPUPID_PGSHIFT(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 #define KASAN_TAG_PGSHIFT  (KASAN_TAG_PGOFF * (KASAN_TAG_WIDTH != 0))
+#define ALLOC_TAG_REF_PGSHIFT  (ALLOC_TAG_REF_PGOFF * (ALLOC_TAG_REF_WIDTH != 
0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -1116,6 +1118,7 @@ static inline bool zone_is_empty(struct zone *zone)
 #define LAST_CPUPID_MASK   ((1UL << LAST_CPUPID_SHIFT) - 1)
 #define KASAN_TAG_MASK ((1UL << KASAN_TAG_WIDTH) - 1)
 #define ZONEID_MASK((1UL << ZONEID_SHIFT) - 1)
+#define ALLOC_TAG_REF_MASK ((1UL << ALLOC_TAG_REF_WIDTH) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
diff --git a/include/linux/page-flags-layout.h 
b/include/linux/page-flags-layout.h
index 7d79818dc065..21bba7c8c965 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -5,6 +5,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+#define ALLOC_TAG_REF_WIDTHCONFIG_PGALLOC_TAG_REF_BITS
+#else
+#define ALLOC_TAG_REF_WIDTH0
+#endif
+
 /*
  * When a memory allocation must conform to specific limitations (such
  * as being suitable for DMA) the caller will pass in hints to the
@@ -91,7 +97,7 @@
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-   KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+   KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_SHIFT <= 
BITS_PER_LONG - NR_PAGEFLAGS
 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
 #else
 #define LAST_CPUPID_WIDTH 0
@@ -102,7 +108,7 @@
 #endif
 
 #if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
-   KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
+   KASAN_TAG_WIDTH + ALLOC_TAG_REF_WIDTH + LAST_CPUPID_WIDTH > 
BITS_PER_LONG - NR_PAGEFLAGS
 #error "Not enough bits in page flags"
 #endif
 
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index a7f8f00c118f..dcb6706dee15 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -118,6 +118,52 @@ static inline void write_pgref(pgalloc_tag_ref *pgref, 
union codetag_ref *ref)
 void __init alloc_tag_sec_init(void);
 
 #endif /* PGALLOC_TAG_DIRECT_REF */
+
+#ifdef CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
+
+typedef struct page*pgtag_ref_handle;
+
+/* Should be called only if mem_alloc_profiling_enabled() */
+static inline pgtag_ref_handle get_page_tag_ref(struct page *page,
+   union codetag_ref *ref)
+{
+   if (page) {
+   pgalloc_tag_ref pgref;
+
+   pgref = (page->flags >> ALLOC_TAG_REF_PGSHIFT) & 
ALLOC_TAG_REF_MASK;
+   read_pgref(&pgref, ref);
+   return page;
+   }
+
+   return NULL;
+}
+
+static inline void put_page_tag_ref(pgtag_ref_handle page)
+{
+   WARN_ON(!page);
+}
+
+static inline void update_page_tag_ref(pgtag_ref_handle page, union 
codetag_ref *ref)
+{
+   unsigned long old_flags, flags, val;
+   pgalloc_tag_ref pgref;
+
+   if (WARN_ON(!page || !ref))
+   return;
+
+   write_pgref(&pgref, ref);
+

[PATCH v2 1/6] maple_tree: add mas_for_each_rev() helper

2024-09-01 Thread Suren Baghdasaryan

Add mas_for_each_rev() function to iterate maple tree nodes in reverse
order.

Suggested-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/maple_tree.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 8e1504a81cd2..45e633806da2 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -592,6 +592,20 @@ static __always_inline void mas_reset(struct ma_state *mas)
 #define mas_for_each(__mas, __entry, __max) \
while (((__entry) = mas_find((__mas), (__max))) != NULL)
 
+/**
+ * mas_for_each_rev() - Iterate over a range of the maple tree in reverse 
order.
+ * @__mas: Maple Tree operation state (maple_state)
+ * @__entry: Entry retrieved from the tree
+ * @__min: minimum index to retrieve from the tree
+ *
+ * When returned, mas->index and mas->last will hold the entire range for the
+ * entry.
+ *
+ * Note: may return the zero entry.
+ */
+#define mas_for_each_rev(__mas, __entry, __min) \
+   while (((__entry) = mas_find_rev((__mas), (__min))) != NULL)
+
 #ifdef CONFIG_DEBUG_MAPLE_TREE
 enum mt_dump_format {
mt_dump_dec,
-- 
2.46.0.469.g59c65b2a67-goog

[PATCH v2 2/6] alloc_tag: load module tags into separate continuous memory

2024-09-01 Thread Suren Baghdasaryan

When a module gets unloaded there is a possibility that some of the
allocations it made are still used and therefore the allocation tags
corresponding to these allocations are still referenced. As such, the
memory for these tags can't be freed. This is currently handled as an
abnormal situation and module's data section is not being unloaded.
To handle this situation without keeping module's data in memory,
allow codetags with longer lifespan than the module to be loaded into
their own separate memory. The in-use memory areas and gaps after
module unloading in this separate memory are tracked using maple trees.
Allocation tags arrange their separate memory so that it is virtually
contiguous and that will allow simple allocation tag indexing later on
in this patchset. The size of this virtually contiguous memory is set
to store up to 10 allocation tags and max_module_alloc_tags kernel
parameter is introduced to change this size.

Signed-off-by: Suren Baghdasaryan 
---
 .../admin-guide/kernel-parameters.txt |   4 +
 include/asm-generic/codetag.lds.h |  19 ++
 include/linux/alloc_tag.h |  13 +-
 include/linux/codetag.h   |  37 ++-
 kernel/module/main.c  |  67 +++--
 lib/alloc_tag.c   | 265 --
 lib/codetag.c | 100 ++-
 scripts/module.lds.S  |   5 +-
 8 files changed, 451 insertions(+), 59 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 8dd0aefea01e..991b1c9ecf0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3205,6 +3205,10 @@
devices can be requested on-demand with the
/dev/loop-control interface.
 
+
+   max_module_alloc_tags=  [KNL] Max supported number of allocation tags
+   in modules.
+
mce [X86-32] Machine Check Exception
 
mce=option  [X86-64] See 
Documentation/arch/x86/x86_64/boot-options.rst
diff --git a/include/asm-generic/codetag.lds.h 
b/include/asm-generic/codetag.lds.h
index 64f536b80380..372c320c5043 100644
--- a/include/asm-generic/codetag.lds.h
+++ b/include/asm-generic/codetag.lds.h
@@ -11,4 +11,23 @@
 #define CODETAG_SECTIONS() \
SECTION_WITH_BOUNDARIES(alloc_tags)
 
+/*
+ * Module codetags which aren't used after module unload, therefore have the
+ * same lifespan as the module and can be safely unloaded with the module.
+ */
+#define MOD_CODETAG_SECTIONS()
+
+#define MOD_SEPARATE_CODETAG_SECTION(_name)\
+   .codetag.##_name : {\
+   SECTION_WITH_BOUNDARIES(_name)  \
+   }
+
+/*
+ * For codetags which might be used after module unload, therefore might stay
+ * longer in memory. Each such codetag type has its own section so that we can
+ * unload them individually once unused.
+ */
+#define MOD_SEPARATE_CODETAG_SECTIONS()\
+   MOD_SEPARATE_CODETAG_SECTION(alloc_tags)
+
 #endif /* __ASM_GENERIC_CODETAG_LDS_H */
diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 8c61ccd161ba..99cbc7f086ad 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,6 +30,13 @@ struct alloc_tag {
struct alloc_tag_counters __percpu  *counters;
 } __aligned(8);
 
+struct alloc_tag_module_section {
+   unsigned long start_addr;
+   unsigned long end_addr;
+   /* used size */
+   unsigned long size;
+};
+
 #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
 
 #define CODETAG_EMPTY  ((void *)1)
@@ -54,6 +61,8 @@ static inline void set_codetag_empty(union codetag_ref *ref) 
{}
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+#define ALLOC_TAG_SECTION_NAME "alloc_tags"
+
 struct codetag_bytes {
struct codetag *ct;
s64 bytes;
@@ -76,7 +85,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag);
 
 #define DEFINE_ALLOC_TAG(_alloc_tag)   
\
static struct alloc_tag _alloc_tag __used __aligned(8)  
\
-   __section("alloc_tags") = { 
\
+   __section(ALLOC_TAG_SECTION_NAME) = {   
\
.ct = CODE_TAG_INIT,
\
.counters = &_shared_alloc_tag };
 
@@ -85,7 +94,7 @@ DECLARE_PER_CPU(struct alloc_tag_counters, _shared_alloc_tag);
 #define DEFINE_ALLOC_TAG(_alloc_tag)   
\
static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);  
\
static struct alloc_tag _alloc_tag __used __aligned(8)  
\
-   __section("alloc_tags") = {

[PATCH v2 5/6] alloc_tag: make page allocation tag reference size configurable

2024-09-01 Thread Suren Baghdasaryan

Introduce CONFIG_PGALLOC_TAG_REF_BITS to control the size of the
page allocation tag references. When the size is configured to be
less than a direct pointer, the tags are searched using an index
stored as the tag reference.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h   | 10 +++-
 include/linux/codetag.h |  3 ++
 include/linux/pgalloc_tag.h | 99 +
 lib/Kconfig.debug   | 11 +
 lib/alloc_tag.c | 51 ++-
 lib/codetag.c   |  4 +-
 mm/mm_init.c|  1 +
 7 files changed, 175 insertions(+), 4 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 21e3098220e3..b5cf24517333 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -30,8 +30,16 @@ struct alloc_tag {
struct alloc_tag_counters __percpu  *counters;
 } __aligned(8);
 
+struct alloc_tag_kernel_section {
+   struct alloc_tag *first_tag;
+   unsigned long count;
+};
+
 struct alloc_tag_module_section {
-   unsigned long start_addr;
+   union {
+   unsigned long start_addr;
+   struct alloc_tag *first_tag;
+   };
unsigned long end_addr;
/* used size */
unsigned long size;
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index fb4e7adfa746..401fc297eeda 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -13,6 +13,9 @@ struct codetag_module;
 struct seq_buf;
 struct module;
 
+#define CODETAG_SECTION_START_PREFIX   "__start_"
+#define CODETAG_SECTION_STOP_PREFIX"__stop_"
+
 /*
  * An instance of this structure is created in a special ELF section at every
  * code location being tagged.  At runtime, the special section is treated as
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index c76b629d0206..a7f8f00c118f 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -9,7 +9,18 @@
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+#if !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS > 32
+#define PGALLOC_TAG_DIRECT_REF
 typedef union codetag_ref  pgalloc_tag_ref;
+#else /* !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS 
> 32 */
+#if CONFIG_PGALLOC_TAG_REF_BITS > 16
+typedef u32pgalloc_tag_ref;
+#else
+typedef u16pgalloc_tag_ref;
+#endif
+#endif /* !defined(CONFIG_PGALLOC_TAG_REF_BITS) || CONFIG_PGALLOC_TAG_REF_BITS 
> 32 */
+
+#ifdef PGALLOC_TAG_DIRECT_REF
 
 static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
 {
@@ -20,6 +31,93 @@ static inline void write_pgref(pgalloc_tag_ref *pgref, union 
codetag_ref *ref)
 {
pgref->ct = ref->ct;
 }
+
+static inline void alloc_tag_sec_init(void) {}
+
+#else /* PGALLOC_TAG_DIRECT_REF */
+
+extern struct alloc_tag_kernel_section kernel_tags;
+
+#define CODETAG_ID_NULL0
+#define CODETAG_ID_EMPTY   1
+#define CODETAG_ID_FIRST   2
+
+#ifdef CONFIG_MODULES
+
+extern struct alloc_tag_module_section module_tags;
+
+static inline struct codetag *get_module_ct(pgalloc_tag_ref pgref)
+{
+   return &module_tags.first_tag[pgref - kernel_tags.count].ct;
+}
+
+static inline pgalloc_tag_ref get_module_pgref(struct alloc_tag *tag)
+{
+   return CODETAG_ID_FIRST + kernel_tags.count + (tag - 
module_tags.first_tag);
+}
+
+#else /* CONFIG_MODULES */
+
+static inline struct codetag *get_module_ct(pgalloc_tag_ref pgref)
+{
+   pr_warn("invalid page tag reference %lu\n", (unsigned long)pgref);
+   return NULL;
+}
+
+static inline pgalloc_tag_ref get_module_pgref(struct alloc_tag *tag)
+{
+   pr_warn("invalid page tag 0x%lx\n", (unsigned long)tag);
+   return CODETAG_ID_NULL;
+}
+
+#endif /* CONFIG_MODULES */
+
+static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   pgalloc_tag_ref pgref_val = *pgref;
+
+   switch (pgref_val) {
+   case (CODETAG_ID_NULL):
+   ref->ct = NULL;
+   break;
+   case (CODETAG_ID_EMPTY):
+   set_codetag_empty(ref);
+   break;
+   default:
+   pgref_val -= CODETAG_ID_FIRST;
+   ref->ct = pgref_val < kernel_tags.count ?
+   &kernel_tags.first_tag[pgref_val].ct :
+   get_module_ct(pgref_val);
+   break;
+   }
+}
+
+static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   struct alloc_tag *tag;
+
+   if (!ref->ct) {
+   *pgref = CODETAG_ID_NULL;
+   return;
+   }
+
+   if (is_codetag_empty(ref)) {
+   *pgref = CODETAG_ID_EMPTY;
+   return;
+   }
+
+   tag = ct_to_alloc_tag(ref->ct);
+   if (tag >= kernel_tags.first_tag && tag < kernel_tags.first_tag + 
kernel_tags.count) {
+

[PATCH v2 3/6] alloc_tag: eliminate alloc_tag_ref_set

2024-09-01 Thread Suren Baghdasaryan

To simplify further refactoring, open-code the only two callers
of alloc_tag_ref_set().

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/alloc_tag.h   | 25 ++---
 include/linux/pgalloc_tag.h | 12 +++-
 2 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
index 99cbc7f086ad..21e3098220e3 100644
--- a/include/linux/alloc_tag.h
+++ b/include/linux/alloc_tag.h
@@ -143,36 +143,15 @@ static inline void alloc_tag_add_check(union codetag_ref 
*ref, struct alloc_tag
 static inline void alloc_tag_sub_check(union codetag_ref *ref) {}
 #endif
 
-/* Caller should verify both ref and tag to be valid */
-static inline void __alloc_tag_ref_set(union codetag_ref *ref, struct 
alloc_tag *tag)
-{
-   ref->ct = &tag->ct;
-   /*
-* We need in increment the call counter every time we have a new
-* allocation or when we split a large allocation into smaller ones.
-* Each new reference for every sub-allocation needs to increment call
-* counter because when we free each part the counter will be 
decremented.
-*/
-   this_cpu_inc(tag->counters->calls);
-}
-
-static inline void alloc_tag_ref_set(union codetag_ref *ref, struct alloc_tag 
*tag)
-{
-   alloc_tag_add_check(ref, tag);
-   if (!ref || !tag)
-   return;
-
-   __alloc_tag_ref_set(ref, tag);
-}
-
 static inline void alloc_tag_add(union codetag_ref *ref, struct alloc_tag 
*tag, size_t bytes)
 {
alloc_tag_add_check(ref, tag);
if (!ref || !tag)
return;
 
-   __alloc_tag_ref_set(ref, tag);
+   ref->ct = &tag->ct;
this_cpu_add(tag->counters->bytes, bytes);
+   this_cpu_inc(tag->counters->calls);
 }
 
 static inline void alloc_tag_sub(union codetag_ref *ref, size_t bytes)
diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 207f0c83c8e9..244a328dff62 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -103,7 +103,17 @@ static inline void pgalloc_tag_split(struct page *page, 
unsigned int nr)
page_ext = page_ext_next(page_ext);
for (i = 1; i < nr; i++) {
/* Set new reference to point to the original tag */
-   alloc_tag_ref_set(codetag_ref_from_page_ext(page_ext), tag);
+   ref = codetag_ref_from_page_ext(page_ext);
+   alloc_tag_add_check(ref, tag);
+   if (ref) {
+   ref->ct = &tag->ct;
+   /*
+* We need in increment the call counter every time we 
split a
+* large allocation into smaller ones because when we 
free each
+* part the counter will be decremented.
+*/
+   this_cpu_inc(tag->counters->calls);
+   }
page_ext = page_ext_next(page_ext);
}
 out:
-- 
2.46.0.469.g59c65b2a67-goog

[PATCH v2 0/6] page allocation tag compression

2024-09-01 Thread Suren Baghdasaryan

This patchset implements several improvements:
1. Gracefully handles module unloading while there are used allocations
allocated from that module;
2. Provides an option to reduce memory overhead from storing page
allocation references by indexing allocation tags;
3. Provides an option to store page allocation tag references in the
page flags, removing dependency on page extensions and eliminating the
memory overhead from storing page allocation references (~0.2% of total
system memory).
4. Improves page allocation performance when CONFIG_MEM_ALLOC_PROFILING
is enabled by eliminating page extension lookup. Page allocation
performance overhead is reduced from 14% to 5.5%.

Patch #1 introduces mas_for_each_rev() helper function.

Patch #2 copies module tags into virtually contiguous memory which
serves two purposes:
- Lets us deal with the situation when module is unloaded while there
are still live allocations from that module. Since we are using a copy
version of the tags we can safely unload the module. Space and gaps in
this contiguous memory are managed using a maple tree.
- Enables simple indexing of the tags in the later patches.

Preallocated virtually contiguous memory size can be configured using
max_module_alloc_tags kernel parameter.

Patch #3 is a code cleanup to simplify later changes.

Patch #4 abstracts page allocation tag reference to simplify later
changes.

Patch #5 lets us control page allocation tag reference sizes and
introduces tag indexing.

Patch #6 adds a config to store page allocation tag references inside
page flags if they fit.

Patchset applies to mm-unstable.

Changes since v1 [1]:
- introduced mas_for_each_rev() and use it, per Liam Howlett
- use advanced maple_tree API to minimize lookups, per Liam Howlett
- fixed CONFIG_MODULES=n configuration build, per kernel test robot

[1] https://lore.kernel.org/all/20240819151512.2363698-1-sur...@google.com/

Suren Baghdasaryan (6):
  maple_tree: add mas_for_each_rev() helper
  alloc_tag: load module tags into separate continuous memory
  alloc_tag: eliminate alloc_tag_ref_set
  alloc_tag: introduce pgalloc_tag_ref to abstract page tag references
  alloc_tag: make page allocation tag reference size configurable
  alloc_tag: config to store page allocation tag refs in page flags

 .../admin-guide/kernel-parameters.txt |   4 +
 include/asm-generic/codetag.lds.h |  19 ++
 include/linux/alloc_tag.h |  46 ++-
 include/linux/codetag.h   |  40 ++-
 include/linux/maple_tree.h|  14 +
 include/linux/mmzone.h|   3 +
 include/linux/page-flags-layout.h |  10 +-
 include/linux/pgalloc_tag.h   | 287 +---
 kernel/module/main.c  |  67 ++--
 lib/Kconfig.debug |  36 +-
 lib/alloc_tag.c   | 321 --
 lib/codetag.c | 104 +-
 mm/mm_init.c  |   1 +
 mm/page_ext.c |   2 +-
 scripts/module.lds.S  |   5 +-
 15 files changed, 826 insertions(+), 133 deletions(-)


base-commit: 18d35b7e30d5a217ff1cc976bb819e1aa2873301
-- 
2.46.0.469.g59c65b2a67-goog

[PATCH v2 4/6] alloc_tag: introduce pgalloc_tag_ref to abstract page tag references

2024-09-01 Thread Suren Baghdasaryan

To simplify later changes to page tag references, introduce new
pgalloc_tag_ref and pgtag_ref_handle types. This allows easy
replacement of page_ext as a storage of page allocation tags

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/pgalloc_tag.h | 144 +++-
 lib/alloc_tag.c |   3 +-
 2 files changed, 95 insertions(+), 52 deletions(-)

diff --git a/include/linux/pgalloc_tag.h b/include/linux/pgalloc_tag.h
index 244a328dff62..c76b629d0206 100644
--- a/include/linux/pgalloc_tag.h
+++ b/include/linux/pgalloc_tag.h
@@ -9,48 +9,76 @@
 
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
+typedef union codetag_ref  pgalloc_tag_ref;
+
+static inline void read_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   ref->ct = pgref->ct;
+}
+
+static inline void write_pgref(pgalloc_tag_ref *pgref, union codetag_ref *ref)
+{
+   pgref->ct = ref->ct;
+}
 #include 
 
 extern struct page_ext_operations page_alloc_tagging_ops;
 
-static inline union codetag_ref *codetag_ref_from_page_ext(struct page_ext 
*page_ext)
+static inline pgalloc_tag_ref *pgref_from_page_ext(struct page_ext *page_ext)
 {
-   return (union codetag_ref *)page_ext_data(page_ext, 
&page_alloc_tagging_ops);
+   return (pgalloc_tag_ref *)page_ext_data(page_ext, 
&page_alloc_tagging_ops);
 }
 
-static inline struct page_ext *page_ext_from_codetag_ref(union codetag_ref 
*ref)
+static inline struct page_ext *page_ext_from_pgref(pgalloc_tag_ref *pgref)
 {
-   return (void *)ref - page_alloc_tagging_ops.offset;
+   return (void *)pgref - page_alloc_tagging_ops.offset;
 }
 
+typedef pgalloc_tag_ref*pgtag_ref_handle;
+
 /* Should be called only if mem_alloc_profiling_enabled() */
-static inline union codetag_ref *get_page_tag_ref(struct page *page)
+static inline pgtag_ref_handle get_page_tag_ref(struct page *page, union 
codetag_ref *ref)
 {
if (page) {
struct page_ext *page_ext = page_ext_get(page);
 
-   if (page_ext)
-   return codetag_ref_from_page_ext(page_ext);
+   if (page_ext) {
+   pgalloc_tag_ref *pgref = pgref_from_page_ext(page_ext);
+
+   read_pgref(pgref, ref);
+   return pgref;
+   }
}
return NULL;
 }
 
-static inline void put_page_tag_ref(union codetag_ref *ref)
+static inline void put_page_tag_ref(pgtag_ref_handle pgref)
 {
-   if (WARN_ON(!ref))
+   if (WARN_ON(!pgref))
return;
 
-   page_ext_put(page_ext_from_codetag_ref(ref));
+   page_ext_put(page_ext_from_pgref(pgref));
+}
+
+static inline void update_page_tag_ref(pgtag_ref_handle pgref, union 
codetag_ref *ref)
+{
+   if (WARN_ON(!pgref || !ref))
+   return;
+
+   write_pgref(pgref, ref);
 }
 
 static inline void clear_page_tag_ref(struct page *page)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   set_codetag_empty(ref);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   set_codetag_empty(&ref);
+   update_page_tag_ref(handle, &ref);
+   put_page_tag_ref(handle);
}
}
 }
@@ -59,11 +87,14 @@ static inline void pgalloc_tag_add(struct page *page, 
struct task_struct *task,
   unsigned int nr)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   alloc_tag_add(ref, task->alloc_tag, PAGE_SIZE * nr);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   alloc_tag_add(&ref, task->alloc_tag, PAGE_SIZE * nr);
+   update_page_tag_ref(handle, &ref);
+   put_page_tag_ref(handle);
}
}
 }
@@ -71,53 +102,58 @@ static inline void pgalloc_tag_add(struct page *page, 
struct task_struct *task,
 static inline void pgalloc_tag_sub(struct page *page, unsigned int nr)
 {
if (mem_alloc_profiling_enabled()) {
-   union codetag_ref *ref = get_page_tag_ref(page);
-
-   if (ref) {
-   alloc_tag_sub(ref, PAGE_SIZE * nr);
-   put_page_tag_ref(ref);
+   pgtag_ref_handle handle;
+   union codetag_ref ref;
+
+   handle = get_page_tag_ref(page, &ref);
+   if (handle) {
+   alloc_tag_sub(&ref, PA

Re: [PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-03 Thread Suren Baghdasaryan

On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton  wrote:
>
> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan  
> wrote:
>
> > Add CONFIG_PGALLOC_TAG_USE_PAGEFLAGS to store allocation tag
> > references directly in the page flags. This removes dependency on
> > page_ext and results in better performance for page allocations as
> > well as reduced page_ext memory overhead.
> > CONFIG_PGALLOC_TAG_REF_BITS controls the number of bits required
> > to be available in the page flags to store the references. If the
> > number of page flag bits is insufficient, the build will fail and
> > either CONFIG_PGALLOC_TAG_REF_BITS would have to be lowered or
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS should be disabled.
> >
> > ...
> >
> > +config PGALLOC_TAG_USE_PAGEFLAGS
> > + bool "Use pageflags to encode page allocation tag reference"
> > + default n
> > + depends on MEM_ALLOC_PROFILING
> > + help
> > +   When set, page allocation tag references are encoded inside page
> > +   flags, otherwise they are encoded in page extensions.
> > +
> > +   Setting this flag reduces memory and performance overhead of memory
> > +   allocation profiling but also limits how many allocations can be
> > +   tagged. The number of bits is set by PGALLOC_TAG_USE_PAGEFLAGS and
> > +   they must fit in the page flags field.
>
> Again.  Please put yourself in the position of one of the all-minus-two
> people in this world who aren't kernel-memory-profiling-developers.
> How the heck are they to decide whether or not to enable this?  OK, 59%
> of them are likely to say "yes" because reasons.  But then what?  How
> are they to determine whether it was the correct choice for them?  If
> we don't tell them, who will?

Fair point. I think one would want to enable this config unless there
aren't enough unused bits if the page flags to address all page
allocation tags. That last part of determining how many bits we need
is a bit tricky.
If we put aside loadable modules for now, there are 3 cases:

1. The number of unused page flag bits is enough to address all page
allocations.
2. The number of unused page flag bits is enough if we push
last_cpupid out of page flags. In that case we get the warning at
https://elixir.bootlin.com/linux/v6.11-rc6/source/mm/mm_init.c#L124.
3. The number of unused page flag bits is not enough even if we push
last_cpupid out of page flags. In that case we get the  "Not enough
bits in page flags" build time error.

So, maybe I should make this option "default y" when
CONFIG_MEM_ALLOC_PROFILING=y and let the user disable it if they hit
case #3 or (case #2 and performance hit is unacceptable)?

For loadable modules, if we hit the limit when loading a module at
runtime, we could issue a warning and disable allocation tagging via
the static key. Another option is to fail to load the module with a
proper warning but that IMO would be less appealing.

>
> >  config PGALLOC_TAG_REF_BITS
> >   int "Number of bits for page allocation tag reference (10-64)"
> >   range 10 64
> > - default "64"
> > + default "16" if PGALLOC_TAG_USE_PAGEFLAGS
> > + default "64" if !PGALLOC_TAG_USE_PAGEFLAGS
> >   depends on MEM_ALLOC_PROFILING
> >   help
> > Number of bits used to encode a page allocation tag reference.
> > @@ -1011,6 +1027,13 @@ config PGALLOC_TAG_REF_BITS
> > Smaller number results in less memory overhead but limits the 
> > number of
> > allocations which can be tagged (including allocations from 
> > modules).
> >
> > +   If PGALLOC_TAG_USE_PAGEFLAGS is set, the number of requested bits 
> > should
> > +   fit inside the page flags.
>
> What does "should fit" mean?  "It is your responsibility to make it
> fit"?  "We think it will fit but we aren't really sure"?

This is the case #3 I described above, the user will get a "Not enough
bits in page flags" build time error. If we stick with this config, I
can clarify that in this description.

>
> > +   If PGALLOC_TAG_USE_PAGEFLAGS is not set, the number of bits used to 
> > store
> > +   a reference is rounded up to the closest basic type. If set higher 
> > than 32,
> > +   a direct pointer to the allocation tag is stored for performance 
> > reasons.
> > +
>
> We shouldn't be offering things like this to our users.  If we cannot decide, 
> how
> can they?

Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
hardest one to set. The user does not k

Re: [PATCH v2 5/6] alloc_tag: make page allocation tag reference size configurable

2024-09-03 Thread Suren Baghdasaryan

On Sun, Sep 1, 2024 at 10:09 PM Andrew Morton  wrote:
>
> On Sun,  1 Sep 2024 21:41:27 -0700 Suren Baghdasaryan  
> wrote:
>
> > Introduce CONFIG_PGALLOC_TAG_REF_BITS to control the size of the
> > page allocation tag references. When the size is configured to be
> > less than a direct pointer, the tags are searched using an index
> > stored as the tag reference.
> >
> > ...
> >
> > +config PGALLOC_TAG_REF_BITS
> > + int "Number of bits for page allocation tag reference (10-64)"
> > + range 10 64
> > + default "64"
> > + depends on MEM_ALLOC_PROFILING
> > + help
> > +   Number of bits used to encode a page allocation tag reference.
> > +
> > +   Smaller number results in less memory overhead but limits the 
> > number of
> > +   allocations which can be tagged (including allocations from 
> > modules).
> > +
>
> In other words, "we have no idea what's best for you, you're on your
> own".
>
> I pity our poor users.
>
> Can we at least tell them what they should look at to determine whether
> whatever random number they chose was helpful or harmful?

At the end of my reply in
https://lore.kernel.org/all/cajucfpgnygx0gw4suhrzmxvh28rgrnfbvfc6wo+f8bd4hdq...@mail.gmail.com/#t
I suggested using all unused page flags. That would simplify things
for the user at the expense of potentially using more memory than we
need. In practice 13 bits should be more than enough to cover all
kernel page allocations with enough headroom for page allocations
coming from loadable modules. I guess using 13 as the default would
cover most cases. In the unlikely case a specific system needs more
tags, the user can increase this value. It can also be set to 64 to
force direct references instead of indexing for better performance.
Would that approach be acceptable?

>

Re: [PATCH v2 5/6] alloc_tag: make page allocation tag reference size configurable

2024-09-03 Thread Suren Baghdasaryan

On Tue, Sep 3, 2024 at 6:17 PM Kent Overstreet
 wrote:
>
> On Tue, Sep 03, 2024 at 06:07:28PM GMT, Suren Baghdasaryan wrote:
> > On Sun, Sep 1, 2024 at 10:09 PM Andrew Morton  
> > wrote:
> > >
> > > On Sun,  1 Sep 2024 21:41:27 -0700 Suren Baghdasaryan  
> > > wrote:
> > >
> > > > Introduce CONFIG_PGALLOC_TAG_REF_BITS to control the size of the
> > > > page allocation tag references. When the size is configured to be
> > > > less than a direct pointer, the tags are searched using an index
> > > > stored as the tag reference.
> > > >
> > > > ...
> > > >
> > > > +config PGALLOC_TAG_REF_BITS
> > > > + int "Number of bits for page allocation tag reference (10-64)"
> > > > + range 10 64
> > > > + default "64"
> > > > + depends on MEM_ALLOC_PROFILING
> > > > + help
> > > > +   Number of bits used to encode a page allocation tag reference.
> > > > +
> > > > +   Smaller number results in less memory overhead but limits the 
> > > > number of
> > > > +   allocations which can be tagged (including allocations from 
> > > > modules).
> > > > +
> > >
> > > In other words, "we have no idea what's best for you, you're on your
> > > own".
> > >
> > > I pity our poor users.
> > >
> > > Can we at least tell them what they should look at to determine whether
> > > whatever random number they chose was helpful or harmful?
> >
> > At the end of my reply in
> > https://lore.kernel.org/all/cajucfpgnygx0gw4suhrzmxvh28rgrnfbvfc6wo+f8bd4hdq...@mail.gmail.com/#t
> > I suggested using all unused page flags. That would simplify things
> > for the user at the expense of potentially using more memory than we
> > need.
>
> Why would that use more memory, and how much?

Say our kernel uses 5000 page allocations and there are additional 100
allocations from all the modules we are loading at runtime. They all
can be addressed using 13 bits (8192 addressable tags), so the
contiguous memory we will be preallocating to store these tags is 8192
* sizeof(alloc_tag). sizeof(alloc_tag) is 40 bytes as of today but
might increase in the future if we add more fields there for other
uses (like gfp_flags for example). So, currently this would use 320KB.
If we always use 16 bits we would be preallocating 2.5MB. So, that
would be 2.2MB of wasted memory. Using more than 16 bits (65536
addressable tags) will be impractical anytime soon (current number
IIRC is a bit over 4000).


>
> > In practice 13 bits should be more than enough to cover all
> > kernel page allocations with enough headroom for page allocations
> > coming from loadable modules. I guess using 13 as the default would
> > cover most cases. In the unlikely case a specific system needs more
> > tags, the user can increase this value. It can also be set to 64 to
> > force direct references instead of indexing for better performance.
> > Would that approach be acceptable?
>
> Any knob that has to be kept track of and adjusted is a real hassle -
> e.g. lockdep has a bunch of knobs that have to be periodically tweaked,
> that's used by _developers_, and they're often wrong.

Yes, I understand, but this config would allow us not to waste these
couple of MBs, provide a way for the user to request direct addressing
of the tags and it also helps us to deal with the case I described in
the last paragraph of my posting at
https://lore.kernel.org/all/cajucfpgnygx0gw4suhrzmxvh28rgrnfbvfc6wo+f8bd4hdq...@mail.gmail.com/#t

Re: [PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-04 Thread Suren Baghdasaryan

On Tue, Sep 3, 2024 at 7:06 PM 'John Hubbard' via kernel-team
 wrote:
>
> On 9/3/24 6:25 PM, John Hubbard wrote:
> > On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> >> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton  
> >> wrote:
> >>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan  
> >>> wrote:
> > ...
> >>> We shouldn't be offering things like this to our users.  If we cannot 
> >>> decide, how
> >>> can they?
> >>
> >> Thinking about the ease of use, the CONFIG_PGALLOC_TAG_REF_BITS is the
> >> hardest one to set. The user does not know how many page allocations
>
> I should probably clarify my previous reply, so here is the more detailed
> version:
>
> >> are there. I think I can simplify this by trying to use all unused
> >> page flag bits for addressing the tags. Then, after compilation we can
>
> Yes.
>
> >> follow the rules I mentioned before:
> >> - If the available bits are not enough to address all kernel page
> >> allocations, we issue an error. The user should disable
> >> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS.
>
> The configuration should disable itself, in this case. But if that is
> too big of a change for now, I suppose we could fall back to an error
> message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
> because the kernel build system is still too primitive to do that for you". :)

I don't think we can detect this at build time. We need to know how
many page allocations there are, which we find out only after we build
the kernel image (from the section size that holds allocation tags).
Therefore it would have to be a post-build check. So I think the best
we can do is to generate the error like the one you suggested after we
build the image.
Dependency on CONFIG_PAGE_EXTENSION is yet another complexity because
if we auto-disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS, we would have to
also auto-enable CONFIG_PAGE_EXTENSION if it's not already enabled.

I'll dig around some more to see if there is a better way.

>
>
> >> - If there are enough unused bits but we have to push last_cpupid out
> >> of page flags, we issue a warning and continue. The user can disable
> >> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> >> flags.
>
> Let's try to decide now, what that tradeoff should be. Just pick one based
> on what some of us perceive to be the expected usefulness and frequency of
> use between last_cpuid and these tag refs.
>
> If someone really needs to change the tradeoff for that one bit, then that
> someone is also likely able to hack up a change for it.

Yeah, from all the feedback, I realize that by pursuing the maximum
flexibility I made configuring this mechanism close to impossible. I
think the first step towards simplifying this would be to identify
usable configurations. From that POV, I can see 3 useful modes:

1. Page flags are not used. In this mode we will use direct pointer
references and page extensions, like we do today. This mode is used
when we don't have enough page flags. This can be a safe default which
keeps things as they are today and should always work.
2. Page flags are used but not forced. This means we will try to use
all free page flags bits (up to a reasonable limit of 16) without
pushing out last_cpupid.
3. Page flags are forced. This means we will try to use all free page
flags bits after pushing last_cpupid out of page flags. This mode
could be used if the user cares about memory profiling more than the
performance overhead caused by last_cpupid.

I'm not 100% sure (3) is needed, so I think we can skip it until
someone asks for it. It should be easy to add that in the future.
If we detect at build time that we don't have enough page flag bits to
cover kernel allocations for modes (2) or (3), we issue an error
prompting the user to reconfigure to mode (1).

Ideally, I would like to have (2) as default mode and automatically
fall back to (1) when it's impossible but as I mentioned before, I
don't yet see a way to do that automatically.

For loadable modules, I think my earlier suggestion should work fine.
If a module causes us to run out of space for tags, we disable memory
profiling at runtime and log a warning for the user stating that we
disabled memory profiling and if the user needs it they should
configure mode (1). I *think* I can even disable profiling only for
that module and not globally but I need to try that first.

I can start with modes (1) and (2) support which requires only
CONFIG_PGALLOC_TAG_USE_PAGEFLAGS defaulted to N. Any user can try
enabling this config and if that builds fine then keeping it for
better performance and memory usage. Does that sound accept

Re: [PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-04 Thread Suren Baghdasaryan

On Tue, Sep 3, 2024 at 7:19 PM Matthew Wilcox  wrote:
>
> On Tue, Sep 03, 2024 at 06:25:52PM -0700, John Hubbard wrote:
> > The more I read this story, the clearer it becomes that this should be
> > entirely done by the build system: set it, or don't set it, automatically.
> >
> > And if you can make it not even a kconfig item at all, that's probably even
> > better.
> >
> > And if there is no way to set it automatically, then that probably means
> > that the feature is still too raw to unleash upon the world.
>
> I'd suggest that this implementation is just too whack.
>
> What if you use a maple tree for this?  For each allocation range, you
> can store a pointer to a tag instead of storing an index in each folio.

I'm not sure I understand your suggestion, Matthew. We allocate a
folio and need to store a reference to the tag associated with the
code that allocated that folio. We are not operating with ranges here.
Are you suggesting to use a maple tree instead of page_ext to store
this reference?

Re: [PATCH v2 5/6] alloc_tag: make page allocation tag reference size configurable

2024-09-04 Thread Suren Baghdasaryan

On Wed, Sep 4, 2024 at 9:25 AM Kent Overstreet
 wrote:
>
> On Tue, Sep 03, 2024 at 07:04:51PM GMT, Suren Baghdasaryan wrote:
> > On Tue, Sep 3, 2024 at 6:17 PM Kent Overstreet
> >  wrote:
> > >
> > > On Tue, Sep 03, 2024 at 06:07:28PM GMT, Suren Baghdasaryan wrote:
> > > > On Sun, Sep 1, 2024 at 10:09 PM Andrew Morton 
> > > >  wrote:
> > > > >
> > > > > On Sun,  1 Sep 2024 21:41:27 -0700 Suren Baghdasaryan 
> > > > >  wrote:
> > > > >
> > > > > > Introduce CONFIG_PGALLOC_TAG_REF_BITS to control the size of the
> > > > > > page allocation tag references. When the size is configured to be
> > > > > > less than a direct pointer, the tags are searched using an index
> > > > > > stored as the tag reference.
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > +config PGALLOC_TAG_REF_BITS
> > > > > > + int "Number of bits for page allocation tag reference (10-64)"
> > > > > > + range 10 64
> > > > > > + default "64"
> > > > > > + depends on MEM_ALLOC_PROFILING
> > > > > > + help
> > > > > > +   Number of bits used to encode a page allocation tag 
> > > > > > reference.
> > > > > > +
> > > > > > +   Smaller number results in less memory overhead but limits 
> > > > > > the number of
> > > > > > +   allocations which can be tagged (including allocations from 
> > > > > > modules).
> > > > > > +
> > > > >
> > > > > In other words, "we have no idea what's best for you, you're on your
> > > > > own".
> > > > >
> > > > > I pity our poor users.
> > > > >
> > > > > Can we at least tell them what they should look at to determine 
> > > > > whether
> > > > > whatever random number they chose was helpful or harmful?
> > > >
> > > > At the end of my reply in
> > > > https://lore.kernel.org/all/cajucfpgnygx0gw4suhrzmxvh28rgrnfbvfc6wo+f8bd4hdq...@mail.gmail.com/#t
> > > > I suggested using all unused page flags. That would simplify things
> > > > for the user at the expense of potentially using more memory than we
> > > > need.
> > >
> > > Why would that use more memory, and how much?
> >
> > Say our kernel uses 5000 page allocations and there are additional 100
> > allocations from all the modules we are loading at runtime. They all
> > can be addressed using 13 bits (8192 addressable tags), so the
> > contiguous memory we will be preallocating to store these tags is 8192
> > * sizeof(alloc_tag). sizeof(alloc_tag) is 40 bytes as of today but
> > might increase in the future if we add more fields there for other
> > uses (like gfp_flags for example). So, currently this would use 320KB.
> > If we always use 16 bits we would be preallocating 2.5MB. So, that
> > would be 2.2MB of wasted memory. Using more than 16 bits (65536
> > addressable tags) will be impractical anytime soon (current number
> > IIRC is a bit over 4000).
>
> I see, it's not about the page bits, it's about the contiguous array of
> alloc tags?
>
> What if we just reserved address space, and only filled it in as needed?

That might be possible. I'll have to try that. Thanks!

Re: [PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-04 Thread Suren Baghdasaryan

On Wed, Sep 4, 2024 at 9:37 AM Kent Overstreet
 wrote:
>
> On Wed, Sep 04, 2024 at 05:35:49PM GMT, Matthew Wilcox wrote:
> > On Wed, Sep 04, 2024 at 09:18:01AM -0700, Suren Baghdasaryan wrote:
> > > I'm not sure I understand your suggestion, Matthew. We allocate a
> > > folio and need to store a reference to the tag associated with the
> > > code that allocated that folio. We are not operating with ranges here.
> > > Are you suggesting to use a maple tree instead of page_ext to store
> > > this reference?
> >
> > I'm saying that a folio has a physical address.  So you can use a physical
> > address as an index into a maple tree to store additional information
> > instead of using page_ext or trying to hammer the additional information
> > into struct page somewhere.
>
> Ah, thanks, that makes more sense.
>
> But it would add a lot of overhead to the page alloc/free paths...

Yeah, inserting into a maple_tree in the fast path of page allocation
would introduce considerable performance overhead.

Re: [PATCH v2 6/6] alloc_tag: config to store page allocation tag refs in page flags

2024-09-04 Thread Suren Baghdasaryan

On Wed, Sep 4, 2024 at 11:58 AM 'John Hubbard' via kernel-team
 wrote:
>
> On 9/4/24 9:08 AM, Suren Baghdasaryan wrote:
> > On Tue, Sep 3, 2024 at 7:06 PM 'John Hubbard' via kernel-team
> >  wrote:
> >> On 9/3/24 6:25 PM, John Hubbard wrote:
> >>> On 9/3/24 11:19 AM, Suren Baghdasaryan wrote:
> >>>> On Sun, Sep 1, 2024 at 10:16 PM Andrew Morton 
> >>>>  wrote:
> >>>>> On Sun,  1 Sep 2024 21:41:28 -0700 Suren Baghdasaryan 
> >>>>>  wrote:
> ...
> >> The configuration should disable itself, in this case. But if that is
> >> too big of a change for now, I suppose we could fall back to an error
> >> message to the effect of, "please disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS
> >> because the kernel build system is still too primitive to do that for 
> >> you". :)
> >
> > I don't think we can detect this at build time. We need to know how
> > many page allocations there are, which we find out only after we build
> > the kernel image (from the section size that holds allocation tags).
> > Therefore it would have to be a post-build check. So I think the best
> > we can do is to generate the error like the one you suggested after we
> > build the image.
> > Dependency on CONFIG_PAGE_EXTENSION is yet another complexity because
> > if we auto-disable CONFIG_PGALLOC_TAG_USE_PAGEFLAGS, we would have to
> > also auto-enable CONFIG_PAGE_EXTENSION if it's not already enabled.
> >
> > I'll dig around some more to see if there is a better way.
> >>
> >>>> - If there are enough unused bits but we have to push last_cpupid out
> >>>> of page flags, we issue a warning and continue. The user can disable
> >>>> CONFIG_PGALLOC_TAG_USE_PAGEFLAGS if last_cpupid has to stay in page
> >>>> flags.
> >>
> >> Let's try to decide now, what that tradeoff should be. Just pick one based
> >> on what some of us perceive to be the expected usefulness and frequency of
> >> use between last_cpuid and these tag refs.
> >>
> >> If someone really needs to change the tradeoff for that one bit, then that
> >> someone is also likely able to hack up a change for it.
> >
> > Yeah, from all the feedback, I realize that by pursuing the maximum
> > flexibility I made configuring this mechanism close to impossible. I
> > think the first step towards simplifying this would be to identify
> > usable configurations. From that POV, I can see 3 useful modes:
> >
> > 1. Page flags are not used. In this mode we will use direct pointer
> > references and page extensions, like we do today. This mode is used
> > when we don't have enough page flags. This can be a safe default which
> > keeps things as they are today and should always work.
>
> Definitely my favorite so far.
>
> > 2. Page flags are used but not forced. This means we will try to use
> > all free page flags bits (up to a reasonable limit of 16) without
> > pushing out last_cpupid.
>
> This is a logical next step, agreed.
>
> > 3. Page flags are forced. This means we will try to use all free page
> > flags bits after pushing last_cpupid out of page flags. This mode
> > could be used if the user cares about memory profiling more than the
> > performance overhead caused by last_cpupid.
> >
> > I'm not 100% sure (3) is needed, so I think we can skip it until
> > someone asks for it. It should be easy to add that in the future.
>
> Right.
>
> > If we detect at build time that we don't have enough page flag bits to
> > cover kernel allocations for modes (2) or (3), we issue an error
> > prompting the user to reconfigure to mode (1).
> >
> > Ideally, I would like to have (2) as default mode and automatically
> > fall back to (1) when it's impossible but as I mentioned before, I
> > don't yet see a way to do that automatically.
> >
> > For loadable modules, I think my earlier suggestion should work fine.
> > If a module causes us to run out of space for tags, we disable memory
> > profiling at runtime and log a warning for the user stating that we
> > disabled memory profiling and if the user needs it they should
> > configure mode (1). I *think* I can even disable profiling only for
> > that module and not globally but I need to try that first.
> >
> > I can start with modes (1) and (2) support which requires only
> > CONFIG_PGALLOC_TAG_USE_PAGEFLAGS defaulted to N. Any user can try
> > enabling this config and if that builds fine

[PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-07-29 Thread Suren Baghdasaryan

When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.

Suggested-by: Nick Kralevich 
Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 7acc632c3b82..ed9a1d573cb1 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group 
*group,
mutex_unlock(&group->trigger_lock);
return ERR_CAST(kworker);
}
-   sched_setscheduler(kworker->task, SCHED_FIFO, ¶m);
+   sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m);
kthread_init_delayed_work(&group->poll_work,
psi_poll_work);
rcu_assign_pointer(group->poll_kworker, kworker);
-- 
2.22.0.709.g102302147b-goog

Re: [PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-07-29 Thread Suren Baghdasaryan

On Mon, Jul 29, 2019 at 12:57 PM Greg KH  wrote:
>
> On Mon, Jul 29, 2019 at 12:42:05PM -0700, Suren Baghdasaryan wrote:
> > When a process creates a new trigger by writing into /proc/pressure/*
> > files, permissions to write such a file should be used to determine whether
> > the process is allowed to do so or not. Current implementation would also
> > require such a process to have setsched capability. Setting of psi trigger
> > thread's scheduling policy is an implementation detail and should not be
> > exposed to the user level. Remove the permission check by using _nocheck
> > version of the function.
> >
> > Suggested-by: Nick Kralevich 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  kernel/sched/psi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
>
> $ ./scripts/get_maintainer.pl --file kernel/sched/psi.c
> Ingo Molnar  (maintainer:SCHEDULER)
> Peter Zijlstra  (maintainer:SCHEDULER)
> linux-ker...@vger.kernel.org (open list:SCHEDULER)
>
>
> No where am I listed there, so why did you send this "To:" me?
>

Oh, sorry about that. Both Ingo and Peter are CC'ed directly. Should I
still resend?

> please fix up and resend.
>
> greg k-h

[PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-07-29 Thread Suren Baghdasaryan

When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.

Suggested-by: Nick Kralevich 
Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 7acc632c3b82..ed9a1d573cb1 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct psi_group 
*group,
mutex_unlock(&group->trigger_lock);
return ERR_CAST(kworker);
}
-   sched_setscheduler(kworker->task, SCHED_FIFO, ¶m);
+   sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m);
kthread_init_delayed_work(&group->poll_work,
psi_poll_work);
rcu_assign_pointer(group->poll_kworker, kworker);
-- 
2.22.0.709.g102302147b-goog

Re: [PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-07-30 Thread Suren Baghdasaryan

On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra  wrote:
>
> On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > When a process creates a new trigger by writing into /proc/pressure/*
> > files, permissions to write such a file should be used to determine whether
> > the process is allowed to do so or not. Current implementation would also
> > require such a process to have setsched capability. Setting of psi trigger
> > thread's scheduling policy is an implementation detail and should not be
> > exposed to the user level. Remove the permission check by using _nocheck
> > version of the function.
> >
> > Suggested-by: Nick Kralevich 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  kernel/sched/psi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > index 7acc632c3b82..ed9a1d573cb1 100644
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct 
> > psi_group *group,
> >   mutex_unlock(&group->trigger_lock);
> >   return ERR_CAST(kworker);
> >   }
> > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m);
> > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, ¶m);
>
> ARGGH, wtf is there a FIFO-99!! thread here at all !?

We need psi poll_kworker to be an rt-priority thread so that psi
notifications are delivered to the userspace without delay even when
the CPUs are very congested. Otherwise it's easy to delay psi
notifications by running a simple CPU hogger executing "chrt -f 50 dd
if=/dev/zero of=/dev/null". Because these notifications are
time-critical for reacting to memory shortages we can't allow for such
delays.
Notice that this kworker is created only if userspace creates a psi
trigger. So unless you are using psi triggers you will never see this
kthread created.

> >   kthread_init_delayed_work(&group->poll_work,
> >   psi_poll_work);
> >   rcu_assign_pointer(group->poll_kworker, kworker);
> > --
> > 2.22.0.709.g102302147b-goog
> >
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-08-01 Thread Suren Baghdasaryan

Hi Peter,
Thanks for sharing your thoughts. I understand your point and I tend
to agree with it. I originally designed this using watchdog as the
example of a critical system health signal and in the context of
mobile device memory pressure is critical but I agree that there are
more important things in life. I checked and your proposal to change
it to FIFO-1 should still work for our purposes. Will test to make
sure and reply to your patch. Couple clarifications in-line.

On Thu, Aug 1, 2019 at 2:51 AM Peter Zijlstra  wrote:
>
> On Tue, Jul 30, 2019 at 10:44:51AM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jul 30, 2019 at 1:11 AM Peter Zijlstra  wrote:
> > >
> > > On Mon, Jul 29, 2019 at 06:33:10PM -0700, Suren Baghdasaryan wrote:
> > > > When a process creates a new trigger by writing into /proc/pressure/*
> > > > files, permissions to write such a file should be used to determine 
> > > > whether
> > > > the process is allowed to do so or not. Current implementation would 
> > > > also
> > > > require such a process to have setsched capability. Setting of psi 
> > > > trigger
> > > > thread's scheduling policy is an implementation detail and should not be
> > > > exposed to the user level. Remove the permission check by using _nocheck
> > > > version of the function.
> > > >
> > > > Suggested-by: Nick Kralevich 
> > > > Signed-off-by: Suren Baghdasaryan 
> > > > ---
> > > >  kernel/sched/psi.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index 7acc632c3b82..ed9a1d573cb1 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -1061,7 +1061,7 @@ struct psi_trigger *psi_trigger_create(struct 
> > > > psi_group *group,
> > > >   mutex_unlock(&group->trigger_lock);
> > > >   return ERR_CAST(kworker);
> > > >   }
> > > > - sched_setscheduler(kworker->task, SCHED_FIFO, ¶m);
> > > > + sched_setscheduler_nocheck(kworker->task, SCHED_FIFO, 
> > > > ¶m);
> > >
> > > ARGGH, wtf is there a FIFO-99!! thread here at all !?
> >
> > We need psi poll_kworker to be an rt-priority thread so that psi
>
> There is a giant difference between 'needs to be higher than OTHER' and
> FIFO-99.
>
> > notifications are delivered to the userspace without delay even when
> > the CPUs are very congested. Otherwise it's easy to delay psi
> > notifications by running a simple CPU hogger executing "chrt -f 50 dd
> > if=/dev/zero of=/dev/null". Because these notifications are
>
> So what; at that point that's exactly what you're asking for. Using RT
> is for those who know what they're doing.
>
> > time-critical for reacting to memory shortages we can't allow for such
> > delays.
>
> Furthermore, actual RT programs will have pre-allocated and locked any
> memory they rely on. They don't give a crap about your pressure
> nonsense.
>

This signal is used not to protect other RT tasks but to monitor
overall system memory health for the sake of system responsiveness.

> > Notice that this kworker is created only if userspace creates a psi
> > trigger. So unless you are using psi triggers you will never see this
> > kthread created.
>
> By marking it FIFO-99 you're in effect saying that your stupid
> statistics gathering is more important than your life. It will preempt
> the task that's in control of the band-saw emergency break, it will
> preempt the task that's adjusting the electromagnetic field containing
> this plasma flow.
>
> That's insane.

IMHO an opt-in feature stops being "stupid" as soon as the user opted
in to use it, therefore explicitly indicating interest in it. However
I assume you are using "stupid" here to indicate that it's "less
important" rather than it's "useless".

> I'm going to queue a patch to reduce this to FIFO-1, that will preempt
> regular OTHER tasks but will not perturb (much) actual RT bits.
>

Thanks for posting the fix.

> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 1/1] psi: do not require setsched permission from the trigger creator

2019-08-01 Thread Suren Baghdasaryan

On Thu, Aug 1, 2019 at 2:59 PM Peter Zijlstra  wrote:
>
> On Thu, Aug 01, 2019 at 11:28:30AM -0700, Suren Baghdasaryan wrote:
> > > By marking it FIFO-99 you're in effect saying that your stupid
> > > statistics gathering is more important than your life. It will preempt
> > > the task that's in control of the band-saw emergency break, it will
> > > preempt the task that's adjusting the electromagnetic field containing
> > > this plasma flow.
> > >
> > > That's insane.
> >
> > IMHO an opt-in feature stops being "stupid" as soon as the user opted
> > in to use it, therefore explicitly indicating interest in it. However
> > I assume you are using "stupid" here to indicate that it's "less
> > important" rather than it's "useless".
>
> Quite; PSI does have its uses. RT just isn't one of them.

Sorry about messing it up in the first place.
If you don't see any issues with my patch replacing
sched_setscheduler() with sched_setscheduler_nocheck(), would you mind
taking it too? I applied it over your patch onto Linus' ToT with no
merge conflicts.
Thanks,
Suren.

> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

[PATCH v5 1/7] psi: introduce state_mask to represent stalled psi states

2019-03-08 Thread Suren Baghdasaryan

The psi monitoring patches will need to determine the same states as
record_times().  To avoid calculating them twice, maintain a state mask
that can be consulted cheaply.  Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

This adds 4-byte state_mask member into psi_group_cpu struct which results
in its first cacheline-aligned part becoming 52 bytes long.  Add explicit
values to enumeration element counters that affect psi_group_cpu struct
size.

Link: http://lkml.kernel.org/r/20190124211518.244221-4-sur...@google.com
Signed-off-by: Suren Baghdasaryan 
Acked-by: Johannes Weiner 
Cc: Dennis Zhou 
Cc: Ingo Molnar 
Cc: Jens Axboe 
Cc: Li Zefan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Stephen Rothwell 
---
 include/linux/psi_types.h |  9 ++---
 kernel/sched/psi.c| 29 +++--
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2cf422db5d18..762c6bb16f3c 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -11,7 +11,7 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
-   NR_PSI_TASK_COUNTS,
+   NR_PSI_TASK_COUNTS = 3,
 };
 
 /* Task state bitmasks */
@@ -24,7 +24,7 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
-   NR_PSI_RESOURCES,
+   NR_PSI_RESOURCES = 3,
 };
 
 /*
@@ -41,7 +41,7 @@ enum psi_states {
PSI_CPU_SOME,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
-   NR_PSI_STATES,
+   NR_PSI_STATES = 6,
 };
 
 struct psi_group_cpu {
@@ -53,6 +53,9 @@ struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];
 
+   /* Aggregate pressure state derived from the tasks */
+   u32 state_mask;
+
/* Period time sampling buckets for each state of interest (ns) */
u32 times[NR_PSI_STATES];
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0e97ca9306ef..22c1505ad290 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -213,17 +213,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
 {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
-   unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
+   enum psi_states s;
unsigned int seq;
-   int s;
+   u32 state_mask;
 
/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
now = cpu_clock(cpu);
memcpy(times, groupc->times, sizeof(groupc->times));
-   memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+   state_mask = groupc->state_mask;
state_start = groupc->state_start;
} while (read_seqcount_retry(&groupc->seq, seq));
 
@@ -239,7 +239,7 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
 * (u32) and our reported pressure close to what's
 * actually happening.
 */
-   if (test_state(tasks, s))
+   if (state_mask & (1 << s))
times[s] += now - state_start;
 
delta = times[s] - groupc->times_prev[s];
@@ -407,15 +407,15 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
delta = now - groupc->state_start;
groupc->state_start = now;
 
-   if (test_state(groupc->tasks, PSI_IO_SOME)) {
+   if (groupc->state_mask & (1 << PSI_IO_SOME)) {
groupc->times[PSI_IO_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_IO_FULL))
+   if (groupc->state_mask & (1 << PSI_IO_FULL))
groupc->times[PSI_IO_FULL] += delta;
}
 
-   if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+   if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
groupc->times[PSI_MEM_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_MEM_FULL))
+   if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
else if (memstall_tick) {
u32 sample;
@@ -436,10 +436,10 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
}
}
 
-   if (test_state(groupc->tasks, PSI_CPU_SOME))
+   if (groupc->state_mask & (1 << PSI_CPU_SOME))
groupc->times[PSI_CPU_SOME] += delta;
 
-   if (test_state(groupc->tasks, PSI_NONIDLE))
+   if (groupc->state_mask & (1 << PSI_NONIDLE))

[PATCH v5 3/7] psi: rename psi fields in preparation for psi trigger addition

2019-03-08 Thread Suren Baghdasaryan

Renaming psi_group structure member fields used for calculating psi totals
and averages for clear distinction between them and trigger-related fields
that will be added next.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h | 14 ++---
 kernel/sched/psi.c| 41 ---
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 762c6bb16f3c..4d1c1f67be18 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -69,17 +69,17 @@ struct psi_group_cpu {
 };
 
 struct psi_group {
-   /* Protects data updated during an aggregation */
-   struct mutex stat_lock;
+   /* Protects data used by the aggregator */
+   struct mutex avgs_lock;
 
/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;
 
-   /* Periodic aggregation state */
-   u64 total_prev[NR_PSI_STATES - 1];
-   u64 last_update;
-   u64 next_update;
-   struct delayed_work clock_work;
+   /* Running pressure averages */
+   u64 avg_total[NR_PSI_STATES - 1];
+   u64 avg_last_update;
+   u64 avg_next_update;
+   struct delayed_work avgs_work;
 
/* Total stall times and sampled pressure averages */
u64 total[NR_PSI_STATES - 1];
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 281702de9772..4fb4d9913bc8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -165,7 +165,7 @@ static struct psi_group psi_system = {
.pcpu = &system_group_pcpu,
 };
 
-static void psi_update_work(struct work_struct *work);
+static void psi_avgs_work(struct work_struct *work);
 
 static void group_init(struct psi_group *group)
 {
@@ -173,9 +173,9 @@ static void group_init(struct psi_group *group)
 
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
-   group->next_update = sched_clock() + psi_period;
-   INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
-   mutex_init(&group->stat_lock);
+   group->avg_next_update = sched_clock() + psi_period;
+   INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
+   mutex_init(&group->avgs_lock);
 }
 
 void __init psi_init(void)
@@ -278,7 +278,7 @@ static bool update_stats(struct psi_group *group)
int cpu;
int s;
 
-   mutex_lock(&group->stat_lock);
+   mutex_lock(&group->avgs_lock);
 
/*
 * Collect the per-cpu time buckets and average them into a
@@ -319,7 +319,7 @@ static bool update_stats(struct psi_group *group)
 
/* avgX= */
now = sched_clock();
-   expires = group->next_update;
+   expires = group->avg_next_update;
if (now < expires)
goto out;
if (now - expires >= psi_period)
@@ -332,14 +332,14 @@ static bool update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + ((1 + missed_periods) * psi_period);
-   period = now - (group->last_update + (missed_periods * psi_period));
-   group->last_update = now;
+   group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
+   period = now - (group->avg_last_update + (missed_periods * psi_period));
+   group->avg_last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
u32 sample;
 
-   sample = group->total[s] - group->total_prev[s];
+   sample = group->total[s] - group->avg_total[s];
/*
 * Due to the lockless sampling of the time buckets,
 * recorded time deltas can slip into the next period,
@@ -359,22 +359,22 @@ static bool update_stats(struct psi_group *group)
 */
if (sample > period)
sample = period;
-   group->total_prev[s] += sample;
+   group->avg_total[s] += sample;
calc_avgs(group->avg[s], missed_periods, sample, period);
}
 out:
-   mutex_unlock(&group->stat_lock);
+   mutex_unlock(&group->avgs_lock);
return nonidle_total;
 }
 
-static void psi_update_work(struct work_struct *work)
+static void psi_avgs_work(struct work_struct *work)
 {
struct delayed_work *dwork;
struct psi_group *group;
bool nonidle;
 
dwork = to_delayed_work(work);
-   group = container_of(dwork, struct psi_group, clock_work);
+   group = container_of(dwork, struct psi_group, avgs_work);
 
/*
 * If there is task activity, periodically fold the per-cpu
@@ -391,8 +391,9 @@ static void psi_update_work(struct work_struct *work)
u64 now;

[PATCH v5 7/7] psi: introduce psi monitor

2019-03-08 Thread Suren Baghdasaryan

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window and threshold are both expressed in usecs. Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Johannes Weiner 
---
 Documentation/accounting/psi.txt | 107 +++
 include/linux/psi.h  |   8 +
 include/linux/psi_types.h|  82 -
 kernel/cgroup/cgroup.c   |  71 -
 kernel/sched/psi.c   | 494 ++-
 5 files changed, 742 insertions(+), 20 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..4fb40fe94828 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -63,6 +63,110 @@ tracked and exported as well, to allow detection of latency 
spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.
 
+Monitoring for pressure thresholds
+==
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used:
+
+  
+
+For example writing "some 15 100" into /proc/pressure/memory
+would add 150ms threshold for partial memory stall measured within
+1sec time window. Writing "full 5 100" into /proc/pressure/io
+would add 50ms threshold for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s. Min limit is set to
+prevent overly frequent polling. Max limit is chosen as a high enough number
+after which monitors are most likely not needed and psi averages can be used
+instead.
+
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+
+Notifications to the userspace are rate-limited to one per tracking window.
+
+The trigger will de-register when the file descriptor used to define the
+trigger  is closed.
+
+Userspace monitor usage example
+===
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Monitor memory partial stall with 1s tracking window size
+ * and 150ms threshold.
+ */
+int main() {
+   const char trig[] = "some 15 100";
+   struct pollfd fds;
+   int n;
+
+   fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+   if (fds.fd < 0) {
+   printf("/proc/pressure/memory open error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+   fds.events = POLLPRI;
+
+   if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+   printf("/proc/pressure/memory write error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+
+   printf("waiting for events...\n");
+

[PATCH v5 2/7] psi: make psi_enable static

2019-03-08 Thread Suren Baghdasaryan

psi_enable is not used outside of psi.c, make it static.

Suggested-by: Andrew Morton 
Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 22c1505ad290..281702de9772 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -140,9 +140,9 @@ static int psi_bug __read_mostly;
 DEFINE_STATIC_KEY_FALSE(psi_disabled);
 
 #ifdef CONFIG_PSI_DEFAULT_DISABLED
-bool psi_enable;
+static bool psi_enable;
 #else
-bool psi_enable = true;
+static bool psi_enable = true;
 #endif
 static int __init setup_psi(char *str)
 {
-- 
2.21.0.360.g471c308f928-goog

[PATCH v5 5/7] psi: track changed states

2019-03-08 Thread Suren Baghdasaryan

Introduce changed_states parameter into collect_percpu_times to track
the states changed since the last update.

Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 337a445aefa3..59e4e1f8bc02 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -210,7 +210,8 @@ static bool test_state(unsigned int *tasks, enum psi_states 
state)
}
 }
 
-static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
+static void get_recent_times(struct psi_group *group, int cpu, u32 *times,
+u32 *pchanged_states)
 {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
u64 now, state_start;
@@ -218,6 +219,8 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
unsigned int seq;
u32 state_mask;
 
+   *pchanged_states = 0;
+
/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
@@ -246,6 +249,8 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
groupc->times_prev[s] = times[s];
 
times[s] = delta;
+   if (delta)
+   *pchanged_states |= (1 << s);
}
 }
 
@@ -269,10 +274,11 @@ static void calc_avgs(unsigned long avg[3], int 
missed_periods,
avg[2] = calc_load(avg[2], EXP_300s, pct);
 }
 
-static bool collect_percpu_times(struct psi_group *group)
+static void collect_percpu_times(struct psi_group *group, u32 *pchanged_states)
 {
u64 deltas[NR_PSI_STATES - 1] = { 0, };
unsigned long nonidle_total = 0;
+   u32 changed_states = 0;
int cpu;
int s;
 
@@ -287,8 +293,11 @@ static bool collect_percpu_times(struct psi_group *group)
for_each_possible_cpu(cpu) {
u32 times[NR_PSI_STATES];
u32 nonidle;
+   u32 cpu_changed_states;
 
-   get_recent_times(group, cpu, times);
+   get_recent_times(group, cpu, times,
+   &cpu_changed_states);
+   changed_states |= cpu_changed_states;
 
nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]);
nonidle_total += nonidle;
@@ -313,7 +322,8 @@ static bool collect_percpu_times(struct psi_group *group)
for (s = 0; s < NR_PSI_STATES - 1; s++)
group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
 
-   return nonidle_total;
+   if (pchanged_states)
+   *pchanged_states = changed_states;
 }
 
 static u64 update_averages(struct psi_group *group, u64 now)
@@ -373,6 +383,7 @@ static void psi_avgs_work(struct work_struct *work)
 {
struct delayed_work *dwork;
struct psi_group *group;
+   u32 changed_states;
bool nonidle;
u64 now;
 
@@ -383,7 +394,8 @@ static void psi_avgs_work(struct work_struct *work)
 
now = sched_clock();
 
-   nonidle = collect_percpu_times(group);
+   collect_percpu_times(group, &changed_states);
+   nonidle = changed_states & (1 << PSI_NONIDLE);
/*
 * If there is task activity, periodically fold the per-cpu
 * times and feed samples into the running averages. If things
@@ -718,7 +730,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, 
enum psi_res res)
 
/* Update averages before reporting them */
mutex_lock(&group->avgs_lock);
-   collect_percpu_times(group);
+   collect_percpu_times(group, NULL);
update_averages(group, sched_clock());
mutex_unlock(&group->avgs_lock);
 
-- 
2.21.0.360.g471c308f928-goog

[PATCH v5 4/7] psi: split update_stats into parts

2019-03-08 Thread Suren Baghdasaryan

Split update_stats into collect_percpu_times and update_averages for
collect_percpu_times to be reused later inside psi monitor.

Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 55 +++---
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 4fb4d9913bc8..337a445aefa3 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -269,17 +269,13 @@ static void calc_avgs(unsigned long avg[3], int 
missed_periods,
avg[2] = calc_load(avg[2], EXP_300s, pct);
 }
 
-static bool update_stats(struct psi_group *group)
+static bool collect_percpu_times(struct psi_group *group)
 {
u64 deltas[NR_PSI_STATES - 1] = { 0, };
-   unsigned long missed_periods = 0;
unsigned long nonidle_total = 0;
-   u64 now, expires, period;
int cpu;
int s;
 
-   mutex_lock(&group->avgs_lock);
-
/*
 * Collect the per-cpu time buckets and average them into a
 * single time sample that is normalized to wallclock time.
@@ -317,11 +313,18 @@ static bool update_stats(struct psi_group *group)
for (s = 0; s < NR_PSI_STATES - 1; s++)
group->total[s] += div_u64(deltas[s], max(nonidle_total, 1UL));
 
+   return nonidle_total;
+}
+
+static u64 update_averages(struct psi_group *group, u64 now)
+{
+   unsigned long missed_periods = 0;
+   u64 expires, period;
+   u64 avg_next_update;
+   int s;
+
/* avgX= */
-   now = sched_clock();
expires = group->avg_next_update;
-   if (now < expires)
-   goto out;
if (now - expires >= psi_period)
missed_periods = div_u64(now - expires, psi_period);
 
@@ -332,7 +335,7 @@ static bool update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
+   avg_next_update = expires + ((1 + missed_periods) * psi_period);
period = now - (group->avg_last_update + (missed_periods * psi_period));
group->avg_last_update = now;
 
@@ -362,9 +365,8 @@ static bool update_stats(struct psi_group *group)
group->avg_total[s] += sample;
calc_avgs(group->avg[s], missed_periods, sample, period);
}
-out:
-   mutex_unlock(&group->avgs_lock);
-   return nonidle_total;
+
+   return avg_next_update;
 }
 
 static void psi_avgs_work(struct work_struct *work)
@@ -372,10 +374,16 @@ static void psi_avgs_work(struct work_struct *work)
struct delayed_work *dwork;
struct psi_group *group;
bool nonidle;
+   u64 now;
 
dwork = to_delayed_work(work);
group = container_of(dwork, struct psi_group, avgs_work);
 
+   mutex_lock(&group->avgs_lock);
+
+   now = sched_clock();
+
+   nonidle = collect_percpu_times(group);
/*
 * If there is task activity, periodically fold the per-cpu
 * times and feed samples into the running averages. If things
@@ -384,18 +392,15 @@ static void psi_avgs_work(struct work_struct *work)
 * go - see calc_avgs() and missed_periods.
 */
 
-   nonidle = update_stats(group);
-
if (nonidle) {
-   unsigned long delay = 0;
-   u64 now;
-
-   now = sched_clock();
-   if (group->avg_next_update > now)
-   delay = nsecs_to_jiffies(
-   group->avg_next_update - now) + 1;
-   schedule_delayed_work(dwork, delay);
+   if (now >= group->avg_next_update)
+   group->avg_next_update = update_averages(group, now);
+
+   schedule_delayed_work(dwork, nsecs_to_jiffies(
+   group->avg_next_update - now) + 1);
}
+
+   mutex_unlock(&group->avgs_lock);
 }
 
 static void record_times(struct psi_group_cpu *groupc, int cpu,
@@ -711,7 +716,11 @@ int psi_show(struct seq_file *m, struct psi_group *group, 
enum psi_res res)
if (static_branch_likely(&psi_disabled))
return -EOPNOTSUPP;
 
-   update_stats(group);
+   /* Update averages before reporting them */
+   mutex_lock(&group->avgs_lock);
+   collect_percpu_times(group);
+   update_averages(group, sched_clock());
+   mutex_unlock(&group->avgs_lock);
 
for (full = 0; full < 2 - (res == PSI_CPU); full++) {
unsigned long avg[3];
-- 
2.21.0.360.g471c308f928-goog

[PATCH v5 6/7] refactor header includes to allow kthread.h inclusion in psi_types.h

2019-03-08 Thread Suren Baghdasaryan

kthread.h can't be included in psi_types.h because it creates a circular
inclusion with kthread.h eventually including psi_types.h and complaining
on kthread structures not being defined because they are defined further
in the kthread.h. Resolve this by removing psi_types.h inclusion from the
headers included from kthread.h.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/kthread.h | 3 ++-
 include/linux/sched.h   | 1 -
 kernel/kthread.c| 1 +
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 2c89e60bc752..0f9da966934e 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -4,7 +4,6 @@
 /* Simple interface for creating and stopping kernel threads without mess. */
 #include 
 #include 
-#include 
 
 __printf(4, 5)
 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
@@ -198,6 +197,8 @@ bool kthread_cancel_delayed_work_sync(struct 
kthread_delayed_work *work);
 
 void kthread_destroy_worker(struct kthread_worker *worker);
 
+struct cgroup_subsys_state;
+
 #ifdef CONFIG_BLK_CGROUP
 void kthread_associate_blkcg(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *kthread_blkcg(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..20b9f03399a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -26,7 +26,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5942eeafb9ac..be4e8795561a 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.21.0.360.g471c308f928-goog

[PATCH v5 0/7] psi: pressure stall monitors v5

2019-03-08 Thread Suren Baghdasaryan

This is respin of:
  https://lwn.net/ml/linux-kernel/20190206023446.177362-1-surenb%40google.com/

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high-enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to
be notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.
For example, using memory stall monitors in userspace low memory
killer daemon (lmkd) we can detect mounting pressure and kill less
important processes before device becomes visibly sluggish. In our
memory stress testing psi memory monitors produce roughly 10x less
false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts
of Android framework to monitor memory state of the device and act
accordingly.

The new interface is straight-forward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:

/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 10 100"
fd = open("/proc/pressure/memory")
write(fd, trigger, sizeof(trigger))
while (poll() >= 0) {
...
};
close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in
order to emit event signals in a timely fashion. Once the stalling
subsides, aggregation reverts back to normal.

The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-6 prepare the psi code for polling support. Patch 7 implements
the adaptive polling logic, the pressure growth detection optimized for
short intervals, and hooks up write() and poll() on the pressure files.

The patches were developed in collaboration with Johannes Weiner.

The patches are based on 5.0-rc8 (Merge tag 'drm-next-2019-03-06').

Suren Baghdasaryan (7):
  psi: introduce state_mask to represent stalled psi states
  psi: make psi_enable static
  psi: rename psi fields in preparation for psi trigger addition
  psi: split update_stats into parts
  psi: track changed states
  refactor header includes to allow kthread.h inclusion in psi_types.h
  psi: introduce psi monitor

 Documentation/accounting/psi.txt | 107 ++
 include/linux/kthread.h  |   3 +-
 include/linux/psi.h  |   8 +
 include/linux/psi_types.h| 105 +-
 include/linux/sched.h|   1 -
 kernel/cgroup/cgroup.c   |  71 +++-
 kernel/kthread.c |   1 +
 kernel/sched/psi.c   | 613 ---
 8 files changed, 833 insertions(+), 76 deletions(-)

Changes in v5:
- Fixed sparse: error: incompatible types in comparison expression, as per
 Andrew
- Changed psi_enable to static, as per Andrew
- Refactored headers to be able to include kthread.h into psi_types.h
without creating a circular inclusion, as per Johannes
- Split psi monitor from aggregator, used RT worker for psi monitoring to
prevent it being starved by other RT threads and memory pressure events
being delayed or lost, as per Minchan and Android Performance Team
- Fixed blockable memory allocation under rcu_read_lock inside
psi_trigger_poll by using refcounting, as per Eva Huang and Minchan
- Misc cleanup and improvements, as per Johannes

Notes:
0001-psi-introduce-state_mask-to-represent-stalled-psi-st.patch is unchanged
from the previous version and provided for completeness.

-- 
2.21.0.360.g471c308f928-goog

[PATCH 0/6] psi: pressure stall monitors

2018-12-14 Thread Suren Baghdasaryan

Android is adopting psi to detect and remedy memory pressure that
results in stuttering and decreased responsiveness on mobile devices.

Psi gives us the stall information, but because we're dealing with
latencies in the millisecond range, periodically reading the pressure
files to detect stalls in a timely fashion is not feasible. Psi also
doesn't aggregate its averages at a high-enough frequency right now.

This patch series extends the psi interface such that users can
configure sensitive latency thresholds and use poll() and friends to
be notified when these are breached.

As high-frequency aggregation is costly, it implements an aggregation
method that is optimized for fast, short-interval averaging, and makes
the aggregation frequency adaptive, such that high-frequency updates
only happen while monitored stall events are actively occurring.

With these patches applied, Android can monitor for, and ward off,
mounting memory shortages before they cause problems for the user.
For example, using memory stall monitors in userspace low memory
killer daemon (lmkd) we can detect mounting pressure and kill less
important processes before device becomes visibly sluggish. In our
memory stress testing psi memory monitors produce roughly 10x less
false positives compared to vmpressure signals. Having ability to
specify multiple triggers for the same psi metric allows other parts
of Android framework to monitor memory state of the device and act
accordingly.

The new interface is straight-forward. The user opens one of the
pressure files for writing and writes a trigger description into the
file descriptor that defines the stall state - some or full, and the
maximum stall time over a given window of time. E.g.:

/* Signal when stall time exceeds 100ms of a 1s window */
char trigger[] = "full 10 100"
fd = open("/proc/pressure/memory")
write(fd, trigger, sizeof(trigger))
while (poll() >= 0) {
...
};
close(fd);

When the monitored stall state is entered, psi adapts its aggregation
frequency according to what the configured time window requires in
order to emit event signals in a timely fashion. Once the stalling
subsides, aggregation reverts back to normal.

The trigger is associated with the open file descriptor. To stop
monitoring, the user only needs to close the file descriptor and the
trigger is discarded.

Patches 1-5 prepare the psi code for polling support. Patch 6
implements the adaptive polling logic, the pressure growth detection
optimized for short intervals, and hooks up write() and poll() on the
pressure files.

The patches were developed in collaboration with Johannes Weiner.

The patches are based on 4.20-rc6.

Johannes Weiner (3):
  fs: kernfs: add poll file operation
  kernel: cgroup: add poll file operation
  psi: eliminate lazy clock mode

Suren Baghdasaryan (3):
  psi: introduce state_mask to represent stalled psi states
  psi: rename psi fields in preparation for psi trigger addition
  psi: introduce psi monitor

 Documentation/accounting/psi.txt | 105 ++
 fs/kernfs/file.c |  31 +-
 include/linux/cgroup-defs.h  |   4 +
 include/linux/kernfs.h   |   6 +
 include/linux/psi.h  |  10 +
 include/linux/psi_types.h|  90 -
 kernel/cgroup/cgroup.c   | 119 ++-
 kernel/sched/psi.c   | 586 +++
 8 files changed, 865 insertions(+), 86 deletions(-)

-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 6/6] psi: introduce psi monitor

2018-12-14 Thread Suren Baghdasaryan

Psi monitor aims to provide a low-latency short-term pressure
detection mechanism configurable by users. It allows users to
monitor psi metrics growth and trigger events whenever a metric
raises above user-defined threshold within user-defined time window.

Time window is expressed in usecs and threshold can be expressed in
usecs or percentages of the tracking window. Multiple psi resources
with different thresholds and window sizes can be monitored concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state. While system
is in the stall state psi signal growth is monitored at a rate of 10 times
per tracking window. Min window size is 500ms, therefore the min monitoring
interval is 50ms. Max window size is 10s with monitoring interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Signed-off-by: Suren Baghdasaryan 
---
 Documentation/accounting/psi.txt | 105 +++
 include/linux/psi.h  |  10 +
 include/linux/psi_types.h|  72 +
 kernel/cgroup/cgroup.c   | 107 ++-
 kernel/sched/psi.c   | 510 +--
 5 files changed, 774 insertions(+), 30 deletions(-)

diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..b006cc84ad44 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -63,6 +63,108 @@ tracked and exported as well, to allow detection of latency 
spikes
 which wouldn't necessarily make a dent in the time averages, or to
 average trends over custom time frames.
 
+Monitoring for pressure thresholds
+==
+
+Users can register triggers and use poll() to be woken up when resource
+pressure exceeds certain thresholds.
+
+A trigger describes the maximum cumulative stall time over a specific
+time window, e.g. 100ms of total stall time within any 500ms window to
+generate a wakeup event.
+
+To register a trigger user has to open psi interface file under
+/proc/pressure/ representing the resource to be monitored and write the
+desired threshold and time window. The open file descriptor should be
+used to wait for trigger events using select(), poll() or epoll().
+The following format is used:
+
+  
+
+For example writing "some 15% 100" or "some 15 100" into
+/proc/pressure/memory would add 15% (150ms) threshold for partial memory
+stall measured within 1sec time window. Writing "full 5% 100" or
+"full 5 100" into /proc/pressure/io would add 5% (50ms) threshold
+for full io stall measured within 1sec time window.
+
+Triggers can be set on more than one psi metric and more than one trigger
+for the same psi metric can be specified. However for each trigger a separate
+file descriptor is required to be able to poll it separately from others,
+therefore for each trigger a separate open() syscall should be made even
+when opening the same psi interface file.
+
+Monitors activate only when system enters stall state for the monitored
+psi metric and deactivates upon exit from the stall state. While system is
+in the stall state psi signal growth is monitored at a rate of 10 times per
+tracking window.
+
+The kernel accepts window sizes ranging from 500ms to 10s, therefore min
+monitoring update interval is 50ms and max is 1s.
+
+When activated, psi monitor stays active for at least the duration of one
+tracking window to avoid repeated activations/deactivations when system is
+bouncing in and out of the stall state.
+
+Notifications to the userspace are rate-limited to one per tracking window.
+
+The trigger will de-register when the file descriptor used to define the
+trigger  is closed.
+
+Userspace monitor usage example
+===
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Monitor memory partial stall with 1s tracking window size
+ * and 15% (150ms) threshold.
+ */
+int main() {
+   const char trig[] = "some 15% 100";
+   struct pollfd fds;
+   int n;
+
+   fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
+   if (fds.fd < 0) {
+   printf("/proc/pressure/memory open error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+   fds.events = POLLPRI;
+
+   if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
+   printf("/proc/pressure/memory write error: %s\n",
+   strerror(errno));
+   return 1;
+   }
+
+   printf("waiting for events...\n");
+   while (1) {
+   n = poll(&fds, 1, -1);
+   if (n <

[PATCH 1/6] fs: kernfs: add poll file operation

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

Kernfs has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 fs/kernfs/file.c   | 31 ---
 include/linux/kernfs.h |  6 ++
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index dbf5bc250bfd..2d8b91f4475d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -832,26 +832,35 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
  * to see if it supports poll (Neither 'poll' nor 'select' return
  * an appropriate error code).  When in doubt, set a suitable timeout value.
  */
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of, poll_table *wait)
+{
+   struct kernfs_node *kn = kernfs_dentry_node(of->file->f_path.dentry);
+   struct kernfs_open_node *on = kn->attr.open;
+
+   poll_wait(of->file, &on->poll, wait);
+
+   if (of->event != atomic_read(&on->event))
+   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+
+   return DEFAULT_POLLMASK;
+}
+
 static __poll_t kernfs_fop_poll(struct file *filp, poll_table *wait)
 {
struct kernfs_open_file *of = kernfs_of(filp);
struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
-   struct kernfs_open_node *on = kn->attr.open;
+   __poll_t ret;
 
if (!kernfs_get_active(kn))
-   goto trigger;
+   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
 
-   poll_wait(filp, &on->poll, wait);
+   if (kn->attr.ops->poll)
+   ret = kn->attr.ops->poll(of, wait);
+   else
+   ret = kernfs_generic_poll(of, wait);
 
kernfs_put_active(kn);
-
-   if (of->event != atomic_read(&on->event))
-   goto trigger;
-
-   return DEFAULT_POLLMASK;
-
- trigger:
-   return DEFAULT_POLLMASK|EPOLLERR|EPOLLPRI;
+   return ret;
 }
 
 static void kernfs_notify_workfn(struct work_struct *work)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5b36b1287a5a..0cac1207bb00 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -25,6 +25,7 @@ struct seq_file;
 struct vm_area_struct;
 struct super_block;
 struct file_system_type;
+struct poll_table_struct;
 
 struct kernfs_open_node;
 struct kernfs_iattrs;
@@ -261,6 +262,9 @@ struct kernfs_ops {
ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes,
 loff_t off);
 
+   __poll_t (*poll)(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
+
int (*mmap)(struct kernfs_open_file *of, struct vm_area_struct *vma);
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -350,6 +354,8 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, 
const char *name,
 int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
 const char *new_name, const void *new_ns);
 int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
+__poll_t kernfs_generic_poll(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
 void kernfs_notify(struct kernfs_node *kn);
 
 const void *kernfs_super_ns(struct super_block *sb);
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 4/6] psi: introduce state_mask to represent stalled psi states

2018-12-14 Thread Suren Baghdasaryan

The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h |  3 +++
 kernel/sched/psi.c| 29 +++--
 2 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2cf422db5d18..2c6e9b67b7eb 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -53,6 +53,9 @@ struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];
 
+   /* Aggregate pressure state derived from the tasks */
+   u32 state_mask;
+
/* Period time sampling buckets for each state of interest (ns) */
u32 times[NR_PSI_STATES];
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d2b9c9a1a62f..153c0624976b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,17 +212,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
 {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
-   unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
+   enum psi_states s;
unsigned int seq;
-   int s;
+   u32 state_mask;
 
/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
now = cpu_clock(cpu);
memcpy(times, groupc->times, sizeof(groupc->times));
-   memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+   state_mask = groupc->state_mask;
state_start = groupc->state_start;
} while (read_seqcount_retry(&groupc->seq, seq));
 
@@ -238,7 +238,7 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
 * (u32) and our reported pressure close to what's
 * actually happening.
 */
-   if (test_state(tasks, s))
+   if (state_mask & (1 << s))
times[s] += now - state_start;
 
delta = times[s] - groupc->times_prev[s];
@@ -390,15 +390,15 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
delta = now - groupc->state_start;
groupc->state_start = now;
 
-   if (test_state(groupc->tasks, PSI_IO_SOME)) {
+   if (groupc->state_mask & (1 << PSI_IO_SOME)) {
groupc->times[PSI_IO_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_IO_FULL))
+   if (groupc->state_mask & (1 << PSI_IO_FULL))
groupc->times[PSI_IO_FULL] += delta;
}
 
-   if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+   if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
groupc->times[PSI_MEM_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_MEM_FULL))
+   if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
else if (memstall_tick) {
u32 sample;
@@ -419,10 +419,10 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
}
}
 
-   if (test_state(groupc->tasks, PSI_CPU_SOME))
+   if (groupc->state_mask & (1 << PSI_CPU_SOME))
groupc->times[PSI_CPU_SOME] += delta;
 
-   if (test_state(groupc->tasks, PSI_NONIDLE))
+   if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
 }
 
@@ -431,6 +431,8 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 {
struct psi_group_cpu *groupc;
unsigned int t, m;
+   enum psi_states s;
+   u32 state_mask = 0;
 
groupc = per_cpu_ptr(group->pcpu, cpu);
 
@@ -463,6 +465,13 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
if (set & (1 << t))
groupc->tasks[t]++;
 
+   /* Calculate state mask representing active states */
+   for (s = 0; s < NR_PSI_STATES; s++) {
+   if (test_state(groupc->tasks, s))
+   state_mask |= (1 << s);
+   }
+   groupc->state_mask = state_mask;
+
write_seqcount_end(&groupc->seq);
 }
 
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 2/6] kernel: cgroup: add poll file operation

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/cgroup-defs.h |  4 
 kernel/cgroup/cgroup.c  | 12 
 2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..6f9ea8601421 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -32,6 +32,7 @@ struct kernfs_node;
 struct kernfs_ops;
 struct kernfs_open_file;
 struct seq_file;
+struct poll_table_struct;
 
 #define MAX_CGROUP_TYPE_NAMELEN 32
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -573,6 +574,9 @@ struct cftype {
ssize_t (*write)(struct kernfs_open_file *of,
 char *buf, size_t nbytes, loff_t off);
 
+   __poll_t (*poll)(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key   lockdep_key;
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6aaf5dd5383b..ffcd7483b8ee 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3499,6 +3499,16 @@ static ssize_t cgroup_file_write(struct kernfs_open_file 
*of, char *buf,
return ret ?: nbytes;
 }
 
+static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt)
+{
+   struct cftype *cft = of->kn->priv;
+
+   if (cft->poll)
+   return cft->poll(of, pt);
+
+   return kernfs_generic_poll(of, pt);
+}
+
 static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
 {
return seq_cft(seq)->seq_start(seq, ppos);
@@ -3537,6 +3547,7 @@ static struct kernfs_ops cgroup_kf_single_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_show   = cgroup_seqfile_show,
 };
 
@@ -3545,6 +3556,7 @@ static struct kernfs_ops cgroup_kf_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_start  = cgroup_seqfile_start,
.seq_next   = cgroup_seqfile_next,
.seq_stop   = cgroup_seqfile_stop,
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 5/6] psi: rename psi fields in preparation for psi trigger addition

2018-12-14 Thread Suren Baghdasaryan

Renaming psi_group structure member fields used for calculating psi
totals and averages for clear distinction between them and trigger-related
fields that will be added next.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h | 15 ---
 kernel/sched/psi.c| 26 ++
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2c6e9b67b7eb..11b32b3395a2 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -69,20 +69,21 @@ struct psi_group_cpu {
 };
 
 struct psi_group {
-   /* Protects data updated during an aggregation */
-   struct mutex stat_lock;
+   /* Protects data used by the aggregator */
+   struct mutex update_lock;
 
/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;
 
-   /* Periodic aggregation state */
-   u64 total_prev[NR_PSI_STATES - 1];
-   u64 last_update;
-   u64 next_update;
struct delayed_work clock_work;
 
-   /* Total stall times and sampled pressure averages */
+   /* Total stall times observed */
u64 total[NR_PSI_STATES - 1];
+
+   /* Running pressure averages */
+   u64 avg_total[NR_PSI_STATES - 1];
+   u64 avg_last_update;
+   u64 avg_next_update;
unsigned long avg[NR_PSI_STATES - 1][3];
 };
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 153c0624976b..694edefdd333 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -172,9 +172,9 @@ static void group_init(struct psi_group *group)
 
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
-   group->next_update = sched_clock() + psi_period;
+   group->avg_next_update = sched_clock() + psi_period;
INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
-   mutex_init(&group->stat_lock);
+   mutex_init(&group->update_lock);
 }
 
 void __init psi_init(void)
@@ -268,7 +268,7 @@ static void update_stats(struct psi_group *group)
int cpu;
int s;
 
-   mutex_lock(&group->stat_lock);
+   mutex_lock(&group->update_lock);
 
/*
 * Collect the per-cpu time buckets and average them into a
@@ -309,7 +309,7 @@ static void update_stats(struct psi_group *group)
 
/* avgX= */
now = sched_clock();
-   expires = group->next_update;
+   expires = group->avg_next_update;
if (now < expires)
goto out;
 
@@ -320,14 +320,14 @@ static void update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + psi_period;
-   period = now - group->last_update;
-   group->last_update = now;
+   group->avg_next_update = expires + psi_period;
+   period = now - group->avg_last_update;
+   group->avg_last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
u32 sample;
 
-   sample = group->total[s] - group->total_prev[s];
+   sample = group->total[s] - group->avg_total[s];
/*
 * Due to the lockless sampling of the time buckets,
 * recorded time deltas can slip into the next period,
@@ -347,11 +347,11 @@ static void update_stats(struct psi_group *group)
 */
if (sample > period)
sample = period;
-   group->total_prev[s] += sample;
+   group->avg_total[s] += sample;
calc_avgs(group->avg[s], sample, period);
}
 out:
-   mutex_unlock(&group->stat_lock);
+   mutex_unlock(&group->update_lock);
 }
 
 static void psi_update_work(struct work_struct *work)
@@ -375,8 +375,10 @@ static void psi_update_work(struct work_struct *work)
update_stats(group);
 
now = sched_clock();
-   if (group->next_update > now)
-   delay = nsecs_to_jiffies(group->next_update - now) + 1;
+   if (group->avg_next_update > now) {
+   delay = nsecs_to_jiffies(
+   group->avg_next_update - now) + 1;
+   }
schedule_delayed_work(dwork, delay);
 }
 
-- 
2.20.0.405.gbc1bbc6f85-goog

[PATCH 3/6] psi: eliminate lazy clock mode

2018-12-14 Thread Suren Baghdasaryan

From: Johannes Weiner 

psi currently stops its periodic 2s aggregation runs when there has
not been any task activity, and wakes it back up later from the
scheduler when the system returns from the idle state.

The coordination between the aggregation worker and the scheduler is
minimal: the scheduler has to nudge the worker if it's not running,
and the worker will reschedule itself periodically until it detects no
more activity.

The polling patches will complicate this, because they introduce
another aggregation mode for high-frequency polling that also
eventually times out if the worker sees no more activity of interest.
That means the scheduler portion would have to coordinate three state
transitions - idle to regular, regular to polling, idle to polling -
with the worker's timeouts and self-rescheduling. The additional
overhead from this is undesirable in the scheduler hotpath.

Eliminate the idle mode and keep the worker doing 2s update intervals
at all times. This eliminates worker coordination from the scheduler
completely. The polling patches will then add it back to switch
between regular mode and high-frequency polling mode.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 kernel/sched/psi.c | 55 +++---
 1 file changed, 22 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc93..d2b9c9a1a62f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -248,18 +248,10 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
}
 }
 
-static void calc_avgs(unsigned long avg[3], int missed_periods,
- u64 time, u64 period)
+static void calc_avgs(unsigned long avg[3], u64 time, u64 period)
 {
unsigned long pct;
 
-   /* Fill in zeroes for periods of no activity */
-   if (missed_periods) {
-   avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods);
-   avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods);
-   avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods);
-   }
-
/* Sample the most recent active period */
pct = div_u64(time * 100, period);
pct *= FIXED_1;
@@ -268,10 +260,9 @@ static void calc_avgs(unsigned long avg[3], int 
missed_periods,
avg[2] = calc_load(avg[2], EXP_300s, pct);
 }
 
-static bool update_stats(struct psi_group *group)
+static void update_stats(struct psi_group *group)
 {
u64 deltas[NR_PSI_STATES - 1] = { 0, };
-   unsigned long missed_periods = 0;
unsigned long nonidle_total = 0;
u64 now, expires, period;
int cpu;
@@ -321,8 +312,6 @@ static bool update_stats(struct psi_group *group)
expires = group->next_update;
if (now < expires)
goto out;
-   if (now - expires > psi_period)
-   missed_periods = div_u64(now - expires, psi_period);
 
/*
 * The periodic clock tick can get delayed for various
@@ -331,8 +320,8 @@ static bool update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + ((1 + missed_periods) * psi_period);
-   period = now - (group->last_update + (missed_periods * psi_period));
+   group->next_update = expires + psi_period;
+   period = now - group->last_update;
group->last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
@@ -359,18 +348,18 @@ static bool update_stats(struct psi_group *group)
if (sample > period)
sample = period;
group->total_prev[s] += sample;
-   calc_avgs(group->avg[s], missed_periods, sample, period);
+   calc_avgs(group->avg[s], sample, period);
}
 out:
mutex_unlock(&group->stat_lock);
-   return nonidle_total;
 }
 
 static void psi_update_work(struct work_struct *work)
 {
struct delayed_work *dwork;
struct psi_group *group;
-   bool nonidle;
+   unsigned long delay = 0;
+   u64 now;
 
dwork = to_delayed_work(work);
group = container_of(dwork, struct psi_group, clock_work);
@@ -383,17 +372,12 @@ static void psi_update_work(struct work_struct *work)
 * go - see calc_avgs() and missed_periods.
 */
 
-   nonidle = update_stats(group);
-
-   if (nonidle) {
-   unsigned long delay = 0;
-   u64 now;
+   update_stats(group);
 
-   now = sched_clock();
-   if (group->next_update > now)
-   delay = nsecs_to_jiffies(group->next_update - now) + 1;
-   schedule_delayed_work(dwork, delay);
-   }
+   now = sched_clock();
+   if (group->next_update > now)

Re: [PATCH 3/6] psi: eliminate lazy clock mode

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 6:58 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:05AM -0800, Suren Baghdasaryan wrote:
> > Eliminate the idle mode and keep the worker doing 2s update intervals
> > at all times.
>
> That sounds like a bad deal.. esp. so for battery powered devices like
> say Andoird.
>
> In general the push has been to always idle everything, see NOHZ and
> NOHZ_FULL and all the work that's being put into getting rid of any and
> all period work.

Thanks for the feedback Peter! The removal of idle mode is unfortunate
but so far we could not find an elegant solution to handle 3 states
(IDLE / REGULAR / POLLING) without additional synchronization inside
the hotpath. The issue, as I remember it, was that while scheduling a
regular update inside psi_group_change() (IDLE to REGULAR transition)
we might override an earlier update being scheduled inside
psi_update_work(). I think we can solve that by using
mod_delayed_work_on() inside psi_update_work() but I might be missing
some other race. I'll discuss this again with Johannes and see if we
can synchronize all states using only atomic operations on clock_mode.

> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 4/6] psi: introduce state_mask to represent stalled psi states

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 7:55 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:06AM -0800, Suren Baghdasaryan wrote:
> > The psi monitoring patches will need to determine the same states as
> > record_times(). To avoid calculating them twice, maintain a state mask
> > that can be consulted cheaply. Do this in a separate patch to keep the
> > churn in the main feature patch at a minimum.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/psi_types.h |  3 +++
> >  kernel/sched/psi.c| 29 +++--
> >  2 files changed, 22 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> > index 2cf422db5d18..2c6e9b67b7eb 100644
> > --- a/include/linux/psi_types.h
> > +++ b/include/linux/psi_types.h
> > @@ -53,6 +53,9 @@ struct psi_group_cpu {
> >   /* States of the tasks belonging to this group */
> >   unsigned int tasks[NR_PSI_TASK_COUNTS];
> >
> > + /* Aggregate pressure state derived from the tasks */
> > + u32 state_mask;
> > +
> >   /* Period time sampling buckets for each state of interest (ns) */
> >   u32 times[NR_PSI_STATES];
> >
>
> Since we spend so much time counting space in that line, maybe add a
> note to the Changlog about how this fits.

Will do.

> Also, since I just had to re-count, you might want to add explicit
> numbers to the psi_res and psi_states enums.

Sounds reasonable.

> > + if (state_mask & (1 << s))
>
> We have the BIT() macro, but I'm honestly not sure that will improve
> things.

I was mimicking the rest of the code in psi.c that uses this kind of
bit masking. Can change if you think that would be better.

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:08AM -0800, Suren Baghdasaryan wrote:
> > +ssize_t psi_trigger_parse(char *buf, size_t nbytes, enum psi_res res,
> > + enum psi_states *state, u32 *threshold_us, u32 *win_sz_us)
> > +{
> > + bool some;
> > + bool threshold_pct;
> > + u32 threshold;
> > + u32 win_sz;
> > + char *p;
> > +
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + /* parse type */
> > + if (!strcmp(p, "some"))
> > + some = true;
> > + else if (!strcmp(p, "full"))
> > + some = false;
> > + else
> > + return -EINVAL;
> > +
> > + switch (res) {
> > + case (PSI_IO):
> > + *state = some ? PSI_IO_SOME : PSI_IO_FULL;
> > + break;
> > + case (PSI_MEM):
> > + *state = some ? PSI_MEM_SOME : PSI_MEM_FULL;
> > + break;
> > + case (PSI_CPU):
> > + if (!some)
> > + return -EINVAL;
> > + *state = PSI_CPU_SOME;
> > + break;
> > + default:
> > + return -EINVAL;
> > + }
> > +
> > + while (isspace(*buf))
> > + buf++;
> > +
> > + p = strsep(&buf, "%");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + if (buf == NULL) {
> > + /* % sign was not found, threshold is specified in us */
> > + buf = p;
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + threshold_pct = false;
> > + } else
> > + threshold_pct = true;
> > +
> > + /* parse threshold */
> > + if (kstrtouint(p, 0, &threshold))
> > + return -EINVAL;
> > +
> > + while (isspace(*buf))
> > + buf++;
> > +
> > + p = strsep(&buf, " ");
> > + if (p == NULL)
> > + return -EINVAL;
> > +
> > + /* Parse window size */
> > + if (kstrtouint(p, 0, &win_sz))
> > + return -EINVAL;
> > +
> > + /* Check window size */
> > + if (win_sz < PSI_TRIG_MIN_WIN_US || win_sz > PSI_TRIG_MAX_WIN_US)
> > + return -EINVAL;
> > +
> > + if (threshold_pct)
> > + threshold = (threshold * win_sz) / 100;
> > +
> > + /* Check threshold */
> > + if (threshold == 0 || threshold > win_sz)
> > + return -EINVAL;
> > +
> > + *threshold_us = threshold;
> > + *win_sz_us = win_sz;
> > +
> > + return 0;
> > +}
>
> How well has this thing been fuzzed? Custom string parser, yay!

Honestly, not much. Normal cases and some obvious corner cases. Will
check if I can use some fuzzer to get more coverage or will write a
script.
I'm not thrilled about writing a custom parser, so if there is a
better way to handle this please advise.

> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-17 Thread Suren Baghdasaryan

On Mon, Dec 17, 2018 at 8:37 AM Peter Zijlstra  wrote:
>
> On Fri, Dec 14, 2018 at 09:15:08AM -0800, Suren Baghdasaryan wrote:
> > @@ -358,28 +526,23 @@ static void psi_update_work(struct work_struct *work)
> >  {
> >   struct delayed_work *dwork;
> >   struct psi_group *group;
> > + u64 next_update;
> >
> >   dwork = to_delayed_work(work);
> >   group = container_of(dwork, struct psi_group, clock_work);
> >
> >   /*
> > +  * Periodically fold the per-cpu times and feed samples
> > +  * into the running averages.
> >*/
> >
> > + psi_update(group);
> >
> > + /* Calculate closest update time */
> > + next_update = min(group->polling_next_update,
> > + group->avg_next_update);
> > + schedule_delayed_work(dwork, min(PSI_FREQ,
> > + nsecs_to_jiffies(next_update - sched_clock()) + 1));
>
> See, so I don't at _all_ like how there is no idle option..

Copy that. Will see what we can do to bring it back.
Thanks!

> >  }
>
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-18 Thread Suren Baghdasaryan

Current design supports only whole percentages and if userspace needs
more granularity then it has to use usecs.
I agree that usecs cover % usecase and "threshold * win / 100" is
simple enough for userspace to calculate. I'm fine with changing to
usecs only.

On Tue, Dec 18, 2018 at 9:30 AM Johannes Weiner  wrote:
>
> On Tue, Dec 18, 2018 at 11:46:22AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 17, 2018 at 05:21:05PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  
> > > wrote:
> >
> > > > How well has this thing been fuzzed? Custom string parser, yay!
> > >
> > > Honestly, not much. Normal cases and some obvious corner cases. Will
> > > check if I can use some fuzzer to get more coverage or will write a
> > > script.
> > > I'm not thrilled about writing a custom parser, so if there is a
> > > better way to handle this please advise.
> >
> > The grammar seems fairly simple, something like:
> >
> >   some-full = "some" | "full" ;
> >   threshold-abs = integer ;
> >   threshold-pct = integer, { "%" } ;
> >   threshold = threshold-abs | threshold-pct ;
> >   window = integer ;
> >   trigger = some-full, space, threshold, space, window ;
> >
> > And that could even be expressed as two scanf formats:
> >
> >  "%4s %u%% %u" , "%4s %u %u"
> >
> > which then gets your something like:
> >
> >   char type[5];
> >
> >   if (sscanf(input, "%4s %u%% %u", &type, &pct, &window) == 3) {
> >   // do pct thing
> >   } else if (sscanf(intput, "%4s %u %u", &type, &thres, &window) == 3) {
> >   // do abs thing
> >   } else return -EFAIL;
> >
> >   if (!strcmp(type, "some")) {
> >   // some
> >   } else if (!strcmp(type, "full")) {
> >   // full
> >   } else return -EFAIL;
> >
> >   // do more
>
> We might want to drop the percentage notation.
>
> While it's somewhat convenient, it's also not unreasonable to ask
> userspace to do a simple "threshold * win / 100" themselves, and it
> would simplify the interface spec and the parser.
>
> Sure, psi outputs percentages, but only for fixed window sizes, so
> that actually saves us something, whereas this parser here needs to
> take a fractional anyway. The output is also in decimal notation,
> which is necessary for granularity. And I really don't think we want
> to add float parsing on top of this interface spec.
>
> So neither the convenience nor the symmetry argument are very
> compelling IMO. It might be better to just not go there.

Re: [PATCH 6/6] psi: introduce psi monitor

2018-12-18 Thread Suren Baghdasaryan

On Tue, Dec 18, 2018 at 11:18 AM Joel Fernandes  wrote:
>
> On Tue, Dec 18, 2018 at 9:58 AM 'Suren Baghdasaryan' via kernel-team
>  wrote:
> >
> > Current design supports only whole percentages and if userspace needs
> > more granularity then it has to use usecs.
> > I agree that usecs cover % usecase and "threshold * win / 100" is
> > simple enough for userspace to calculate. I'm fine with changing to
> > usecs only.
>
> Suren, please avoid top-posting to LKML.

Sorry, did that by accident.

> Also I was going to say the same thing, just usecs only is better.

Thanks for the input.

> thanks,
>
>  - Joel


> > On Tue, Dec 18, 2018 at 9:30 AM Johannes Weiner  wrote:
> > >
> > > On Tue, Dec 18, 2018 at 11:46:22AM +0100, Peter Zijlstra wrote:
> > > > On Mon, Dec 17, 2018 at 05:21:05PM -0800, Suren Baghdasaryan wrote:
> > > > > On Mon, Dec 17, 2018 at 8:22 AM Peter Zijlstra  
> > > > > wrote:
> > > >
> > > > > > How well has this thing been fuzzed? Custom string parser, yay!
> > > > >
> > > > > Honestly, not much. Normal cases and some obvious corner cases. Will
> > > > > check if I can use some fuzzer to get more coverage or will write a
> > > > > script.
> > > > > I'm not thrilled about writing a custom parser, so if there is a
> > > > > better way to handle this please advise.
> > > >
> > > > The grammar seems fairly simple, something like:
> > > >
> > > >   some-full = "some" | "full" ;
> > > >   threshold-abs = integer ;
> > > >   threshold-pct = integer, { "%" } ;
> > > >   threshold = threshold-abs | threshold-pct ;
> > > >   window = integer ;
> > > >   trigger = some-full, space, threshold, space, window ;
> > > >
> > > > And that could even be expressed as two scanf formats:
> > > >
> > > >  "%4s %u%% %u" , "%4s %u %u"
> > > >
> > > > which then gets your something like:
> > > >
> > > >   char type[5];
> > > >
> > > >   if (sscanf(input, "%4s %u%% %u", &type, &pct, &window) == 3) {
> > > >   // do pct thing
> > > >   } else if (sscanf(intput, "%4s %u %u", &type, &thres, &window) == 3) {
> > > >   // do abs thing
> > > >   } else return -EFAIL;
> > > >
> > > >   if (!strcmp(type, "some")) {
> > > >   // some
> > > >   } else if (!strcmp(type, "full")) {
> > > >   // full
> > > >   } else return -EFAIL;
> > > >
> > > >   // do more
> > >
> > > We might want to drop the percentage notation.
> > >
> > > While it's somewhat convenient, it's also not unreasonable to ask
> > > userspace to do a simple "threshold * win / 100" themselves, and it
> > > would simplify the interface spec and the parser.
> > >
> > > Sure, psi outputs percentages, but only for fixed window sizes, so
> > > that actually saves us something, whereas this parser here needs to
> > > take a fractional anyway. The output is also in decimal notation,
> > > which is necessary for granularity. And I really don't think we want
> > > to add float parsing on top of this interface spec.
> > >
> > > So neither the convenience nor the symmetry argument are very
> > > compelling IMO. It might be better to just not go there.
> >
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "kernel-team" group.
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to kernel-team+unsubscr...@android.com.
> >

[PATCH v2 3/5] psi: introduce state_mask to represent stalled psi states

2019-01-10 Thread Suren Baghdasaryan

The psi monitoring patches will need to determine the same states as
record_times(). To avoid calculating them twice, maintain a state mask
that can be consulted cheaply. Do this in a separate patch to keep the
churn in the main feature patch at a minimum.
This adds 4-byte state_mask member into psi_group_cpu struct which
results in its first cacheline-aligned part to become 52 bytes long.
Add explicit values to enumeration element counters that affect
psi_group_cpu struct size.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h |  9 ++---
 kernel/sched/psi.c| 29 +++--
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 2cf422db5d18..762c6bb16f3c 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -11,7 +11,7 @@ enum psi_task_count {
NR_IOWAIT,
NR_MEMSTALL,
NR_RUNNING,
-   NR_PSI_TASK_COUNTS,
+   NR_PSI_TASK_COUNTS = 3,
 };
 
 /* Task state bitmasks */
@@ -24,7 +24,7 @@ enum psi_res {
PSI_IO,
PSI_MEM,
PSI_CPU,
-   NR_PSI_RESOURCES,
+   NR_PSI_RESOURCES = 3,
 };
 
 /*
@@ -41,7 +41,7 @@ enum psi_states {
PSI_CPU_SOME,
/* Only per-CPU, to weigh the CPU in the global average: */
PSI_NONIDLE,
-   NR_PSI_STATES,
+   NR_PSI_STATES = 6,
 };
 
 struct psi_group_cpu {
@@ -53,6 +53,9 @@ struct psi_group_cpu {
/* States of the tasks belonging to this group */
unsigned int tasks[NR_PSI_TASK_COUNTS];
 
+   /* Aggregate pressure state derived from the tasks */
+   u32 state_mask;
+
/* Period time sampling buckets for each state of interest (ns) */
u32 times[NR_PSI_STATES];
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index fe24de3fbc93..2262d920295f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -212,17 +212,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 static void get_recent_times(struct psi_group *group, int cpu, u32 *times)
 {
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
-   unsigned int tasks[NR_PSI_TASK_COUNTS];
u64 now, state_start;
+   enum psi_states s;
unsigned int seq;
-   int s;
+   u32 state_mask;
 
/* Snapshot a coherent view of the CPU state */
do {
seq = read_seqcount_begin(&groupc->seq);
now = cpu_clock(cpu);
memcpy(times, groupc->times, sizeof(groupc->times));
-   memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
+   state_mask = groupc->state_mask;
state_start = groupc->state_start;
} while (read_seqcount_retry(&groupc->seq, seq));
 
@@ -238,7 +238,7 @@ static void get_recent_times(struct psi_group *group, int 
cpu, u32 *times)
 * (u32) and our reported pressure close to what's
 * actually happening.
 */
-   if (test_state(tasks, s))
+   if (state_mask & (1 << s))
times[s] += now - state_start;
 
delta = times[s] - groupc->times_prev[s];
@@ -406,15 +406,15 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
delta = now - groupc->state_start;
groupc->state_start = now;
 
-   if (test_state(groupc->tasks, PSI_IO_SOME)) {
+   if (groupc->state_mask & (1 << PSI_IO_SOME)) {
groupc->times[PSI_IO_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_IO_FULL))
+   if (groupc->state_mask & (1 << PSI_IO_FULL))
groupc->times[PSI_IO_FULL] += delta;
}
 
-   if (test_state(groupc->tasks, PSI_MEM_SOME)) {
+   if (groupc->state_mask & (1 << PSI_MEM_SOME)) {
groupc->times[PSI_MEM_SOME] += delta;
-   if (test_state(groupc->tasks, PSI_MEM_FULL))
+   if (groupc->state_mask & (1 << PSI_MEM_FULL))
groupc->times[PSI_MEM_FULL] += delta;
else if (memstall_tick) {
u32 sample;
@@ -435,10 +435,10 @@ static void record_times(struct psi_group_cpu *groupc, 
int cpu,
}
}
 
-   if (test_state(groupc->tasks, PSI_CPU_SOME))
+   if (groupc->state_mask & (1 << PSI_CPU_SOME))
groupc->times[PSI_CPU_SOME] += delta;
 
-   if (test_state(groupc->tasks, PSI_NONIDLE))
+   if (groupc->state_mask & (1 << PSI_NONIDLE))
groupc->times[PSI_NONIDLE] += delta;
 }
 
@@ -447,6 +447,8 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 {
struct psi_group_cpu *groupc;
unsigned int t, m;
+   enum psi_states s;
+   u32 state_mask =

[PATCH v2 4/5] psi: rename psi fields in preparation for psi trigger addition

2019-01-10 Thread Suren Baghdasaryan

Renaming psi_group structure member fields used for calculating psi
totals and averages for clear distinction between them and trigger-related
fields that will be added next.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/psi_types.h | 15 ---
 kernel/sched/psi.c| 26 ++
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 762c6bb16f3c..47757668bdcb 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -69,20 +69,21 @@ struct psi_group_cpu {
 };
 
 struct psi_group {
-   /* Protects data updated during an aggregation */
-   struct mutex stat_lock;
+   /* Protects data used by the aggregator */
+   struct mutex update_lock;
 
/* Per-cpu task state & time tracking */
struct psi_group_cpu __percpu *pcpu;
 
-   /* Periodic aggregation state */
-   u64 total_prev[NR_PSI_STATES - 1];
-   u64 last_update;
-   u64 next_update;
struct delayed_work clock_work;
 
-   /* Total stall times and sampled pressure averages */
+   /* Total stall times observed */
u64 total[NR_PSI_STATES - 1];
+
+   /* Running pressure averages */
+   u64 avg_total[NR_PSI_STATES - 1];
+   u64 avg_last_update;
+   u64 avg_next_update;
unsigned long avg[NR_PSI_STATES - 1][3];
 };
 
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 2262d920295f..c366503ba135 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -172,9 +172,9 @@ static void group_init(struct psi_group *group)
 
for_each_possible_cpu(cpu)
seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
-   group->next_update = sched_clock() + psi_period;
+   group->avg_next_update = sched_clock() + psi_period;
INIT_DELAYED_WORK(&group->clock_work, psi_update_work);
-   mutex_init(&group->stat_lock);
+   mutex_init(&group->update_lock);
 }
 
 void __init psi_init(void)
@@ -277,7 +277,7 @@ static bool update_stats(struct psi_group *group)
int cpu;
int s;
 
-   mutex_lock(&group->stat_lock);
+   mutex_lock(&group->update_lock);
 
/*
 * Collect the per-cpu time buckets and average them into a
@@ -318,7 +318,7 @@ static bool update_stats(struct psi_group *group)
 
/* avgX= */
now = sched_clock();
-   expires = group->next_update;
+   expires = group->avg_next_update;
if (now < expires)
goto out;
if (now - expires > psi_period)
@@ -331,14 +331,14 @@ static bool update_stats(struct psi_group *group)
 * But the deltas we sample out of the per-cpu buckets above
 * are based on the actual time elapsing between clock ticks.
 */
-   group->next_update = expires + ((1 + missed_periods) * psi_period);
-   period = now - (group->last_update + (missed_periods * psi_period));
-   group->last_update = now;
+   group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
+   period = now - (group->avg_last_update + (missed_periods * psi_period));
+   group->avg_last_update = now;
 
for (s = 0; s < NR_PSI_STATES - 1; s++) {
u32 sample;
 
-   sample = group->total[s] - group->total_prev[s];
+   sample = group->total[s] - group->avg_total[s];
/*
 * Due to the lockless sampling of the time buckets,
 * recorded time deltas can slip into the next period,
@@ -358,11 +358,11 @@ static bool update_stats(struct psi_group *group)
 */
if (sample > period)
sample = period;
-   group->total_prev[s] += sample;
+   group->avg_total[s] += sample;
calc_avgs(group->avg[s], missed_periods, sample, period);
}
 out:
-   mutex_unlock(&group->stat_lock);
+   mutex_unlock(&group->update_lock);
return nonidle_total;
 }
 
@@ -390,8 +390,10 @@ static void psi_update_work(struct work_struct *work)
u64 now;
 
now = sched_clock();
-   if (group->next_update > now)
-   delay = nsecs_to_jiffies(group->next_update - now) + 1;
+   if (group->avg_next_update > now) {
+   delay = nsecs_to_jiffies(
+   group->avg_next_update - now) + 1;
+   }
schedule_delayed_work(dwork, delay);
}
 }
-- 
2.20.1.97.g81188d93c3-goog

[PATCH v2 2/5] kernel: cgroup: add poll file operation

2019-01-10 Thread Suren Baghdasaryan

From: Johannes Weiner 

Cgroup has a standardized poll/notification mechanism for waking all
pollers on all fds when a filesystem node changes. To allow polling
for custom events, add a .poll callback that can override the default.

This is in preparation for pollable cgroup pressure files which have
per-fd trigger configurations.

Signed-off-by: Johannes Weiner 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/cgroup-defs.h |  4 
 kernel/cgroup/cgroup.c  | 12 
 2 files changed, 16 insertions(+)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5e1694fe035b..6f9ea8601421 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -32,6 +32,7 @@ struct kernfs_node;
 struct kernfs_ops;
 struct kernfs_open_file;
 struct seq_file;
+struct poll_table_struct;
 
 #define MAX_CGROUP_TYPE_NAMELEN 32
 #define MAX_CGROUP_ROOT_NAMELEN 64
@@ -573,6 +574,9 @@ struct cftype {
ssize_t (*write)(struct kernfs_open_file *of,
 char *buf, size_t nbytes, loff_t off);
 
+   __poll_t (*poll)(struct kernfs_open_file *of,
+struct poll_table_struct *pt);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lock_class_key   lockdep_key;
 #endif
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7a8429f8e280..3f533f95acdc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3499,6 +3499,16 @@ static ssize_t cgroup_file_write(struct kernfs_open_file 
*of, char *buf,
return ret ?: nbytes;
 }
 
+static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt)
+{
+   struct cftype *cft = of->kn->priv;
+
+   if (cft->poll)
+   return cft->poll(of, pt);
+
+   return kernfs_generic_poll(of, pt);
+}
+
 static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
 {
return seq_cft(seq)->seq_start(seq, ppos);
@@ -3537,6 +3547,7 @@ static struct kernfs_ops cgroup_kf_single_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_show   = cgroup_seqfile_show,
 };
 
@@ -3545,6 +3556,7 @@ static struct kernfs_ops cgroup_kf_ops = {
.open   = cgroup_file_open,
.release= cgroup_file_release,
.write  = cgroup_file_write,
+   .poll   = cgroup_file_poll,
.seq_start  = cgroup_seqfile_start,
.seq_next   = cgroup_seqfile_next,
.seq_stop   = cgroup_seqfile_stop,
-- 
2.20.1.97.g81188d93c3-goog

1 2 >

1 - 100 of 153 matches

Mail list logo