Re: [PATCH v6 07/15] digest_cache: Allow registration of digest list parsers

2024-11-26 Thread Luis Chamberlain
On Tue, Nov 26, 2024 at 11:25:07AM +0100, Roberto Sassu wrote:
> On Mon, 2024-11-25 at 15:53 -0800, Luis Chamberlain wrote:
> 
> Firmware, eBPF programs and so on are supposed

Keyword: "supposed". 

> As far as the LSM infrastructure is concerned, I'm not adding new LSM
> hooks, nor extending/modifying the existing ones. The operations the
> Integrity Digest Cache is doing match the usage expectation by LSM (net
> denying access, as discussed with Paul Moore).

If modules are the only proven exception to your security model you are
not making the case for it clearly.

> The Integrity Digest Cache is supposed to be used as a supporting tool
> for other LSMs to do regular access control based on file data and
> metadata integrity. In doing that, it still needs the LSM
> infrastructure to notify about filesystem changes, and to store
> additional information in the inode and file descriptor security blobs.
> 
> The kernel_post_read_file LSM hook should be implemented by another LSM
> to verify the integrity of a digest list, when the Integrity Digest
> Cache calls kernel_read_file() to read that digest list.

If LSM folks *do* agree that this work is *suplementing* LSMS then sure,
it was not clear from the commit logs. But then you need to ensure the
parsers are special snowflakes which won't ever incur other additional
kernel_read_file() calls.

> Supporting kernel modules opened the road for new deadlocks, since one
> can ask a digest list to verify a kernel module, but that digest list
> requires the same kernel module. That is why the in-kernel mechanism is
> 100% reliable,

Are users of this infrastructure really in need of modules for these
parsers?

  Luis



Re: [PATCH] docs: media: document media multi-committers rules and process

2024-11-26 Thread Laurent Pinchart
Hi Mauro and Hans,

On Mon, Nov 25, 2024 at 02:28:58PM +0100, Mauro Carvalho Chehab wrote:
> As the media subsystem will experiment with a multi-committers model,
> update the Maintainer's entry profile to the new rules, and add a file
> documenting the process to become a committer and to maintain such
> rights.
> 
> Signed-off-by: Mauro Carvalho Chehab 
> Signed-off-by: Hans Verkuil 
> ---
>  Documentation/driver-api/media/index.rst  |   1 +
>  .../media/maintainer-entry-profile.rst| 193 ++
>  .../driver-api/media/media-committer.rst  | 252 ++
>  .../process/maintainer-pgp-guide.rst  |   2 +
>  4 files changed, 398 insertions(+), 50 deletions(-)
>  create mode 100644 Documentation/driver-api/media/media-committer.rst
> 
> diff --git a/Documentation/driver-api/media/index.rst 
> b/Documentation/driver-api/media/index.rst
> index d5593182a3f9..d0c725fcbc67 100644
> --- a/Documentation/driver-api/media/index.rst
> +++ b/Documentation/driver-api/media/index.rst
> @@ -26,6 +26,7 @@ Documentation/userspace-api/media/index.rst
>  :numbered:
>  
>  maintainer-entry-profile
> +media-committer
>  
>  v4l2-core
>  dtv-core
> diff --git a/Documentation/driver-api/media/maintainer-entry-profile.rst 
> b/Documentation/driver-api/media/maintainer-entry-profile.rst
> index ffc712a5f632..90c6c0d9cf17 100644
> --- a/Documentation/driver-api/media/maintainer-entry-profile.rst
> +++ b/Documentation/driver-api/media/maintainer-entry-profile.rst
> @@ -27,19 +27,128 @@ It covers, mainly, the contents of those directories:
>  Both media userspace and Kernel APIs are documented and the documentation
>  must be kept in sync with the API changes. It means that all patches that
>  add new features to the subsystem must also bring changes to the
> -corresponding API files.
> +corresponding API documentation files.

I would have split this kind of small changes to a separate patch to
make reviews easier, but that's not a big deal.

>  
> -Due to the size and wide scope of the media subsystem, media's
> -maintainership model is to have sub-maintainers that have a broad
> -knowledge of a specific aspect of the subsystem. It is the sub-maintainers'
> -task to review the patches, providing feedback to users if the patches are
> +Due to the size and wide scope of the media subsystem, the media's
> +maintainership model is to have committers that have a broad knowledge of
> +a specific aspect of the subsystem. It is the committers' task to
> +review the patches, providing feedback to users if the patches are
>  following the subsystem rules and are properly using the media kernel and
>  userspace APIs.

This sounds really like a maintainer definition. I won't bikeshed too
much on the wording though, we will always be able to adjust it later to
reflect the reality of the situation as it evolves. I do like the
removal of the "sub-maintainer" term though, as I've always found it
demeaning.

>  
> -Patches for the media subsystem must be sent to the media mailing list
> -at linux-me...@vger.kernel.org as plain text only e-mail. Emails with
> -HTML will be automatically rejected by the mail server. It could be wise
> -to also copy the sub-maintainer(s).
> +Media committers
> +
> +
> +In the media subsystem, there are experienced developers that can commit

s/that/who/
s/commit/push/ to standardize the vocabulary (below you use "upload" to
mean the same thing)

> +patches directly on a development tree. These developers are called

s/on a/to the/

> +Media committers and are divided into the following categories:
> +
> +- Committers: responsible for one or more drivers within the media subsystem.
> +  They can upload changes to the tree that do not affect the core or ABI.

s/upload/push/

> +
> +- Core committers: responsible for part of the media core. They are typically
> +  responsible for one or more drivers within the media subsystem, but, 
> besides
> +  that, they can also merge patches that change the code common to multiple
> +  drivers, including the kernel internal API/ABI.

I would write "API" only here. Neither the kernel internal API nor its
internal ABI are stable, and given that lack of stability, the ABI
concept doesn't really apply within the kernel.

> +
> +- Subsystem maintainers: responsible for the subsystem as a whole, with
> +  access to the entire subsystem.
> +
> +  Only subsystem maintainers can change the userspace API/ABI.

This can give the impression that only subsystem maintainers are allowed
to work on the API. I would write

  Only subsystem maintainers change push changes that affect the userspace
  API/ABI.

> +
> +Media committers shall explicitly agree with the Kernel development process

Do we have to capitalize "Kernel" everywhere ? There are way more
occurrences of "kernel" than "Kernel" in Documentation/ (even excluding
the lower case occurrences in e-mail addresses, file paths, ...).

> +as described at Documentation

Re: [PATCH v4 0/9] mm: workingset reporting

2024-11-26 Thread Johannes Weiner
On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references. However, the concept of workingset applies generically to
> all types of memory, which could be kernel slab caches, discardable
> userspace caches (databases), or CXL.mem. Therefore, data sources might
> come from slab shrinkers, device drivers, or the userspace.
> Another interesting idea might be hugepage workingset, so that we can
> measure the proportion of hugepages backing cold memory. However, with
> architectures like arm, there may be too many hugepage sizes leading to
> a combinatorial explosion when exporting stats to the userspace.
> Nonetheless, the kernel should provide a set of workingset interfaces
> that is generic enough to accommodate the various use cases, and extensible
> to potential future use cases.

Doesn't DAMON already provide this information?

CCing SJ.

> Use cases
> ==
> Job scheduling
> On overcommitted hosts, workingset information improves efficiency and
> reliability by allowing the job scheduler to have better stats on the
> exact memory requirements of each job. This can manifest in efficiency by
> landing more jobs on the same host or NUMA node. On the other hand, the
> job scheduler can also ensure each node has a sufficient amount of memory
> and does not enter direct reclaim or the kernel OOM path. With workingset
> information and job priority, the userspace OOM killing or proactive
> reclaim policy can kick in before the system is under memory pressure.
> If the job shape is very different from the machine shape, knowing the
> workingset per-node can also help inform page allocation policies.
> 
> Proactive reclaim
> Workingset information allows the a container manager to proactively
> reclaim memory while not impacting a job's performance. While PSI may
> provide a reactive measure of when a proactive reclaim has reclaimed too
> much, workingset reporting allows the policy to be more accurate and
> flexible.

I'm not sure about more accurate.

Access frequency is only half the picture. Whether you need to keep
memory with a given frequency resident depends on the speed of the
backing device.

There is memory compression; there is swap on flash; swap on crappy
flash; swapfiles that share IOPS with co-located filesystems. There is
zswap+writeback, where avg refault speed can vary dramatically.

You can of course offload much more to a fast zswap backend than to a
swapfile on a struggling flashdrive, with comparable app performance.

So I think you'd be hard pressed to achieve a high level of accuracy
in the usecases you list without taking the (often highly dynamic)
cost of paging / memory transfer into account.

There is a more detailed discussion of this in a paper we wrote on
proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:

https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf

> Ballooning (similar to proactive reclaim)
> The last patch of the series extends the virtio-balloon device to report
> the guest workingset.
> Balloon policies benefit from workingset to more precisely determine the
> size of the memory balloon. On end-user devices where memory is scarce and
> overcommitted, the balloon sizing in multiple VMs running on the same
> device can be orchestrated with workingset reports from each one.
> On the server side, workingset reporting allows the balloon controller to
> inflate the balloon without causing too much file cache to be reclaimed in
> the guest.
> 
> Promotion/Demotion
> If different mechanisms are used for promition and demotion, workingset
> information can help connect the two and avoid pages being migrated back
> and forth.
> For example, given a promotion hot page threshold defined in reaccess
> distance of N seconds (promote pages accessed more often than every N
> seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> the fast memory node passes the threshold. This calculation can be done
> with workingset reports.
> To be directly useful for promotion policies, the workingset report
> interfaces need to be extended to report hotness and gather hotness
> information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
>
> Sysfs and Cgroup Interfaces
> ==
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 2 anon=34342 file=0
> 3 anon=353232 file=333608
> 4 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
>
> Implementation
> ==
> The reporting of user pages is based off of MGLRU, and therefore requires
> CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
> fine-grained workingset rep

Re: [PATCH v4 1/9] mm: aggregate workingset information into histograms

2024-11-26 Thread Matthew Wilcox
On Tue, Nov 26, 2024 at 06:57:20PM -0800, Yuanchu Xie wrote:
> diff --git a/mm/internal.h b/mm/internal.h
> index 64c2eb0b160e..bbd3c1501bac 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn;
>  /*
>   * in mm/vmscan.c:
>   */
> +struct scan_control;
> +bool isolate_lru_page(struct page *page);

Is this a mismerge?  It doesn't exist any more.



[PATCH v4 1/9] mm: aggregate workingset information into histograms

2024-11-26 Thread Yuanchu Xie
Hierarchically aggregate all memcgs' MGLRU generations and their
page counts into working set page age histograms.
The histograms break down the system's workingset per-node,
per-anon/file.

The sysfs interfaces are as follows:
/sys/devices/system/node/nodeX/workingset_report/page_age
A per-node page age histogram, showing an aggregate of the
node's lruvecs. The information is extracted from MGLRU's
per-generation page counters. Reading this file causes a
hierarchical aging of all lruvecs, scanning pages and creates a
new generation in each lruvec.
For example:
1000 anon=0 file=0
2000 anon=0 file=0
10 anon=5533696 file=5566464
18446744073709551615 anon=0 file=0

/sys/devices/system/node/nodeX/workingset_report/page_age_interval
A comma separated list of time in milliseconds that configures
what the page age histogram uses for aggregation.

Signed-off-by: Yuanchu Xie 
---
 drivers/base/node.c   |   6 +
 include/linux/mmzone.h|   9 +
 include/linux/workingset_report.h |  79 ++
 mm/Kconfig|   9 +
 mm/Makefile   |   1 +
 mm/internal.h |   5 +
 mm/memcontrol.c   |   2 +
 mm/mm_init.c  |   2 +
 mm/mmzone.c   |   2 +
 mm/vmscan.c   |  10 +-
 mm/workingset_report.c| 451 ++
 11 files changed, 572 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c

diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb72580288e6..ba5b8720dbfa 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static const struct bus_type node_subsys = {
.name = "node",
@@ -626,6 +628,7 @@ static int register_node(struct node *node, int num)
} else {
hugetlb_register_node(node);
compaction_register_node(node);
+   wsr_init_sysfs(node);
}
 
return error;
@@ -642,6 +645,9 @@ void unregister_node(struct node *node)
 {
hugetlb_unregister_node(node);
compaction_unregister_node(node);
+   wsr_remove_sysfs(node);
+   wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id)));
+   wsr_destroy_pgdat(NODE_DATA(node->dev.id));
node_remove_accesses(node);
node_remove_caches(node);
device_unregister(&node->dev);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 80bc5640bb60..ee728c0c5a3b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Free memory management - zoned buddy allocator.  */
 #ifndef CONFIG_ARCH_FORCE_MAX_ORDER
@@ -630,6 +631,9 @@ struct lruvec {
struct lru_gen_mm_state mm_state;
 #endif
 #endif /* CONFIG_LRU_GEN */
+#ifdef CONFIG_WORKINGSET_REPORT
+   struct wsr_statewsr;
+#endif /* CONFIG_WORKINGSET_REPORT */
 #ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
 #endif
@@ -1424,6 +1428,11 @@ typedef struct pglist_data {
struct lru_gen_memcg memcg_lru;
 #endif
 
+#ifdef CONFIG_WORKINGSET_REPORT
+   struct mutex wsr_update_mutex;
+   struct wsr_report_bins __rcu *wsr_page_age_bins;
+#endif
+
CACHELINE_PADDING(_pad2_);
 
/* Per-node vmstats */
diff --git a/include/linux/workingset_report.h 
b/include/linux/workingset_report.h
new file mode 100644
index ..d7c2ee14ec87
--- /dev/null
+++ b/include/linux/workingset_report.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_WORKINGSET_REPORT_H
+#define _LINUX_WORKINGSET_REPORT_H
+
+#include 
+#include 
+
+struct mem_cgroup;
+struct pglist_data;
+struct node;
+struct lruvec;
+
+#ifdef CONFIG_WORKINGSET_REPORT
+
+#define WORKINGSET_REPORT_MIN_NR_BINS 2
+#define WORKINGSET_REPORT_MAX_NR_BINS 32
+
+#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1)
+#define ANON_AND_FILE 2
+
+struct wsr_report_bin {
+   unsigned long idle_age;
+   unsigned long nr_pages[ANON_AND_FILE];
+};
+
+struct wsr_report_bins {
+   /* excludes the WORKINGSET_INTERVAL_MAX bin */
+   unsigned long nr_bins;
+   /* last bin contains WORKINGSET_INTERVAL_MAX */
+   unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS];
+   struct rcu_head rcu;
+};
+
+struct wsr_page_age_histo {
+   unsigned long timestamp;
+   struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS];
+};
+
+struct wsr_state {
+   /* breakdown of workingset by page age */
+   struct mutex page_age_lock;
+   struct wsr_page_age_histo *page_age;
+};
+
+void wsr_init_lruvec(struct lruvec *lruvec);
+void wsr_destroy_lruvec(struct lruvec *lruvec);
+void wsr_init_pgdat(struct pglist_data *pgdat);
+void wsr_destroy_pgdat(struct pglist_dat

[PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation

2024-11-26 Thread Yuanchu Xie
The refresh interval is a rate limiting factor to workingset page age
histogram reads. When a workingset report is generated, the oldest
timestamp of all the lruvecs is stored as the timestamp of the report.
The same report will be read until the report expires beyond the refresh
interval, at which point a new report is generated.

Sysfs interface
/sys/devices/system/node/nodeX/workingset_report/refresh_interval
time in milliseconds specifying how long the report is valid for

Signed-off-by: Yuanchu Xie 
---
 include/linux/workingset_report.h |   1 +
 mm/workingset_report.c| 101 --
 2 files changed, 83 insertions(+), 19 deletions(-)

diff --git a/include/linux/workingset_report.h 
b/include/linux/workingset_report.h
index d7c2ee14ec87..8bae6a600410 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -37,6 +37,7 @@ struct wsr_page_age_histo {
 };
 
 struct wsr_state {
+   unsigned long refresh_interval;
/* breakdown of workingset by page age */
struct mutex page_age_lock;
struct wsr_page_age_histo *page_age;
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index a4dcf62fcd96..8678536ccfc7 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -174,9 +174,11 @@ static void collect_page_age_type(const struct 
lru_gen_folio *lrugen,
  * Assume the heuristic that pages are in the MGLRU generation
  * through uniform accesses, so we can aggregate them
  * proportionally into bins.
+ *
+ * Returns the timestamp of the youngest gen in this lruvec.
  */
-static void collect_page_age(struct wsr_page_age_histo *page_age,
-const struct lruvec *lruvec)
+static unsigned long collect_page_age(struct wsr_page_age_histo *page_age,
+ const struct lruvec *lruvec)
 {
int type;
const struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -191,11 +193,14 @@ static void collect_page_age(struct wsr_page_age_histo 
*page_age,
for (type = 0; type < ANON_AND_FILE; type++)
collect_page_age_type(lrugen, bin, max_seq, min_seq[type],
  curr_timestamp, type);
+
+   return READ_ONCE(lruvec->lrugen.timestamps[lru_gen_from_seq(max_seq)]);
 }
 
 /* First step: hierarchically scan child memcgs. */
 static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root,
-struct pglist_data *pgdat)
+struct pglist_data *pgdat,
+unsigned long refresh_interval)
 {
struct mem_cgroup *memcg;
unsigned int flags;
@@ -208,12 +213,15 @@ static void refresh_scan(struct wsr_state *wsr, struct 
mem_cgroup *root,
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+   int gen = lru_gen_from_seq(max_seq);
+   unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
 
/*
 * setting can_swap=true and force_scan=true ensures
 * proper workingset stats when the system cannot swap.
 */
-   try_to_inc_max_seq(lruvec, max_seq, true, true);
+   if (time_is_before_jiffies(birth + refresh_interval))
+   try_to_inc_max_seq(lruvec, max_seq, true, true);
cond_resched();
} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
 
@@ -228,6 +236,7 @@ static void refresh_aggregate(struct wsr_page_age_histo 
*page_age,
 {
struct mem_cgroup *memcg;
struct wsr_report_bin *bin;
+   unsigned long oldest_lruvec_time = jiffies;
 
for (bin = page_age->bins;
 bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) {
@@ -241,11 +250,15 @@ static void refresh_aggregate(struct wsr_page_age_histo 
*page_age,
memcg = mem_cgroup_iter(root, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+   unsigned long lruvec_time =
+   collect_page_age(page_age, lruvec);
+
+   if (time_before(lruvec_time, oldest_lruvec_time))
+   oldest_lruvec_time = lruvec_time;
 
-   collect_page_age(page_age, lruvec);
cond_resched();
} while ((memcg = mem_cgroup_iter(root, memcg, NULL)));
-   WRITE_ONCE(page_age->timestamp, jiffies);
+   WRITE_ONCE(page_age->timestamp, oldest_lruvec_time);
 }
 
 static void copy_node_bins(struct pglist_data *pgdat,
@@ -270,17 +283,25 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct 
mem_cgroup *root,
struct pglist_data *pgdat)
 {
struct wsr_page_age_histo *page_age;
+   unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
 
if (!READ_ONCE(wsr->page_age))
return false;
 
-  

[PATCH v4 3/9] mm: report workingset during memory pressure driven scanning

2024-11-26 Thread Yuanchu Xie
When a node reaches its low watermarks and wakes up kswapd, notify all
userspace programs waiting on the workingset page age histogram of the
memory pressure, so a userspace agent can read the workingset report in
time and make policy decisions, such as logging, oom-killing, or
migration.

Sysfs interface:
/sys/devices/system/node/nodeX/workingset_report/report_threshold
time in milliseconds that specifies how often the userspace
agent can be notified for node memory pressure.

Signed-off-by: Yuanchu Xie 
---
 include/linux/workingset_report.h |  4 +++
 mm/internal.h | 12 
 mm/vmscan.c   | 46 +++
 mm/workingset_report.c| 43 -
 4 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/workingset_report.h 
b/include/linux/workingset_report.h
index 8bae6a600410..2ec8b927b200 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -37,7 +37,11 @@ struct wsr_page_age_histo {
 };
 
 struct wsr_state {
+   unsigned long report_threshold;
unsigned long refresh_interval;
+
+   struct kernfs_node *page_age_sys_file;
+
/* breakdown of workingset by page age */
struct mutex page_age_lock;
struct wsr_page_age_histo *page_age;
diff --git a/mm/internal.h b/mm/internal.h
index bbd3c1501bac..508b7d9937d6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -479,6 +479,18 @@ bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned 
long seq, bool can_swap,
bool force_scan);
 void set_task_reclaim_state(struct task_struct *task, struct reclaim_state 
*rs);
 
+#ifdef CONFIG_WORKINGSET_REPORT
+/*
+ * in mm/wsr.c
+ */
+void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat);
+#else
+static inline void notify_workingset(struct mem_cgroup *memcg,
+struct pglist_data *pgdat)
+{
+}
+#endif
+
 /*
  * in mm/rmap.c:
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 89da4d8dfb5f..2bca81271d15 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2578,6 +2578,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_WORKINGSET_REPORT
+static void try_to_report_workingset(struct pglist_data *pgdat, struct 
scan_control *sc);
+#else
+static inline void try_to_report_workingset(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+}
+#endif
+
 #ifdef CONFIG_LRU_GEN
 
 #ifdef CONFIG_LRU_GEN_ENABLED
@@ -4004,6 +4013,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, 
struct scan_control *sc)
 
set_initial_priority(pgdat, sc);
 
+   try_to_report_workingset(pgdat, sc);
+
memcg = mem_cgroup_iter(NULL, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
@@ -5649,6 +5660,38 @@ static int __init init_lru_gen(void)
 };
 late_initcall(init_lru_gen);
 
+#ifdef CONFIG_WORKINGSET_REPORT
+static void try_to_report_workingset(struct pglist_data *pgdat,
+struct scan_control *sc)
+{
+   struct mem_cgroup *memcg = sc->target_mem_cgroup;
+   struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr;
+   unsigned long threshold = READ_ONCE(wsr->report_threshold);
+
+   if (sc->priority == DEF_PRIORITY)
+   return;
+
+   if (!threshold)
+   return;
+
+   if (!mutex_trylock(&wsr->page_age_lock))
+   return;
+
+   if (!wsr->page_age) {
+   mutex_unlock(&wsr->page_age_lock);
+   return;
+   }
+
+   if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) {
+   mutex_unlock(&wsr->page_age_lock);
+   return;
+   }
+
+   mutex_unlock(&wsr->page_age_lock);
+   notify_workingset(memcg, pgdat);
+}
+#endif /* CONFIG_WORKINGSET_REPORT */
+
 #else /* !CONFIG_LRU_GEN */
 
 static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control 
*sc)
@@ -6200,6 +6243,9 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
if (zone->zone_pgdat == last_pgdat)
continue;
last_pgdat = zone->zone_pgdat;
+
+   if (!sc->proactive)
+   try_to_report_workingset(zone->zone_pgdat, sc);
shrink_node(zone->zone_pgdat, sc);
}
 
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index 8678536ccfc7..bbefb0046669 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -320,6 +320,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj)
return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr;
 }
 
+static ssize_t report_threshold_show(struct kobject *kobj,
+struct kobj_attribute *attr, char *buf)
+{
+   struct wsr_s

[PATCH v4 6/9] selftest: test system-wide workingset reporting

2024-11-26 Thread Yuanchu Xie
A basic test that verifies the working set size of a simple memory
accessor. It should work with or without the aging thread.

When running tests with run_vmtests.sh, file workingset report testing
requires an environment variable WORKINGSET_REPORT_TEST_FILE_PATH to
store a temporary file, which is passed into the test invocation as a
parameter.

Signed-off-by: Yuanchu Xie 
---
 tools/testing/selftests/mm/.gitignore |   1 +
 tools/testing/selftests/mm/Makefile   |   3 +
 tools/testing/selftests/mm/run_vmtests.sh |   5 +
 .../testing/selftests/mm/workingset_report.c  | 306 
 .../testing/selftests/mm/workingset_report.h  |  39 +++
 .../selftests/mm/workingset_report_test.c | 330 ++
 6 files changed, 684 insertions(+)
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

diff --git a/tools/testing/selftests/mm/.gitignore 
b/tools/testing/selftests/mm/.gitignore
index da030b43e43b..e5cd0085ab74 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -51,3 +51,4 @@ hugetlb_madv_vs_map
 mseal_test
 seal_elf
 droppable
+workingset_report_test
diff --git a/tools/testing/selftests/mm/Makefile 
b/tools/testing/selftests/mm/Makefile
index 0f8c110e0805..5c6a7464da6e 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
 TEST_GEN_FILES += hugetlb_madv_vs_map
 TEST_GEN_FILES += hugetlb_dio
 TEST_GEN_FILES += droppable
+TEST_GEN_FILES += workingset_report_test
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
@@ -138,6 +139,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c
 $(OUTPUT)/uffd-stress: uffd-common.c
 $(OUTPUT)/uffd-unit-tests: uffd-common.c
 
+$(OUTPUT)/workingset_report_test: workingset_report.c
+
 ifeq ($(ARCH),x86_64)
 BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
 BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
diff --git a/tools/testing/selftests/mm/run_vmtests.sh 
b/tools/testing/selftests/mm/run_vmtests.sh
index c5797ad1d37b..63782667381a 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -75,6 +75,8 @@ separated by spaces:
read-only VMAs
 - mdwe
test prctl(PR_SET_MDWE, ...)
+- workingset_report
+   test workingset reporting
 
 example: ./run_vmtests.sh -t "hmm mmap ksm"
 EOF
@@ -456,6 +458,9 @@ CATEGORY="mkdirty" run_test ./mkdirty
 
 CATEGORY="mdwe" run_test ./mdwe_test
 
+CATEGORY="workingset_report" run_test ./workingset_report_test \
+  "${WORKINGSET_REPORT_TEST_FILE_PATH}"
+
 echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | 
tap_prefix
 echo "1..${count_total}" | tap_output
 
diff --git a/tools/testing/selftests/mm/workingset_report.c 
b/tools/testing/selftests/mm/workingset_report.c
new file mode 100644
index ..ee4dda5c371d
--- /dev/null
+++ b/tools/testing/selftests/mm/workingset_report.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "workingset_report.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online"
+#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches"
+
+/* Returns read len on success, or -errno on failure. */
+static ssize_t read_text(const char *path, char *buf, size_t max_len)
+{
+   ssize_t len;
+   int fd, err;
+   size_t bytes_read = 0;
+
+   if (!max_len)
+   return -EINVAL;
+
+   fd = open(path, O_RDONLY);
+   if (fd < 0)
+   return -errno;
+
+   while (bytes_read < max_len - 1) {
+   len = read(fd, buf + bytes_read, max_len - 1 - bytes_read);
+
+   if (len <= 0)
+   break;
+   bytes_read += len;
+   }
+
+   buf[bytes_read] = '\0';
+
+   err = -errno;
+   close(fd);
+   return len < 0 ? err : bytes_read;
+}
+
+/* Returns written len on success, or -errno on failure. */
+static ssize_t write_text(const char *path, const char *buf, ssize_t max_len)
+{
+   int fd, len, err;
+   size_t bytes_written = 0;
+
+   fd = open(path, O_WRONLY | O_APPEND);
+   if (fd < 0)
+   return -errno;
+
+   while (bytes_written < max_len) {
+   len = write(fd, buf + bytes_written, max_len - bytes_written);
+
+   if (len < 0)
+   break;
+   bytes_written += len;
+   }
+
+   err = -errno;
+   close(fd);
+   return len < 0 ? err : bytes_written;
+}
+
+static long read_num(const char *path)
+{
+   char buf[21];
+
+   if (read_text(path, buf, sizeof(buf)) <= 0)
+   return -1;
+   return (long)strtoul(buf, NULL, 10);
+}
+
+static int writ

[PATCH v4 4/9] mm: extend workingset reporting to memcgs

2024-11-26 Thread Yuanchu Xie
Break down the system-wide workingset report into per-memcg reports,
which aggregages its children hierarchically.

The per-node workingset reporting histograms and refresh/report
threshold files are presented as memcg files, showing a report
containing all the nodes.

The per-node page age interval is configurable in sysfs and not
available per-memcg, while the refresh interval and report threshold are
configured per-memcg.

Memcg interface:
/sys/fs/cgroup/.../memory.workingset.page_age
The memcg equivalent of the sysfs workingset page age histogram
breaks down the workingset of this memcg and its children into
page age intervals. Each node is prefixed with a node header and
a newline. Non-proactive direct reclaim on this memcg can also
wake up userspace agents that are waiting on this file.
e.g.
N0
1000 anon=0 file=0
2000 anon=0 file=0
3000 anon=0 file=0
4000 anon=0 file=0
5000 anon=0 file=0
18446744073709551615 anon=0 file=0

/sys/fs/cgroup/.../memory.workingset.refresh_interval
The memcg equivalent of the sysfs refresh interval. A per-node
number of how much time a page age histogram is valid for, in
milliseconds.
e.g.
echo N0=2000 > memory.workingset.refresh_interval

/sys/fs/cgroup/.../memory.workingset.report_threshold
The memcg equivalent of the sysfs report threshold. A per-node
number of how often userspace agent waiting on the page age
histogram can be woken up, in milliseconds.
e.g.
echo N0=1000 > memory.workingset.report_threshold

Signed-off-by: Yuanchu Xie 
---
 include/linux/memcontrol.h|  21 
 include/linux/workingset_report.h |  15 ++-
 mm/internal.h |   2 +
 mm/memcontrol.c   | 160 +-
 mm/workingset_report.c|  50 +++---
 5 files changed, 230 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1b41554a5fb..fd595b33a54d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -323,6 +323,11 @@ struct mem_cgroup {
spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+#ifdef CONFIG_WORKINGSET_REPORT
+   /* memory.workingset.page_age file */
+   struct cgroup_file workingset_page_age_file;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -1094,6 +1099,16 @@ static inline void memcg_memory_event_mm(struct 
mm_struct *mm,
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
 
+static inline struct cgroup_file *
+mem_cgroup_page_age_file(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_WORKINGSET_REPORT
+   return &memcg->workingset_page_age_file;
+#else
+   return NULL;
+#endif
+}
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT0
@@ -1511,6 +1526,12 @@ void count_memcg_event_mm(struct mm_struct *mm, enum 
vm_event_item idx)
 static inline void split_page_memcg(struct page *head, int old_order, int 
new_order)
 {
 }
+
+static inline struct cgroup_file *
+mem_cgroup_page_age_file(struct mem_cgroup *memcg)
+{
+   return NULL;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/include/linux/workingset_report.h 
b/include/linux/workingset_report.h
index 2ec8b927b200..616be6469768 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -9,6 +9,8 @@ struct mem_cgroup;
 struct pglist_data;
 struct node;
 struct lruvec;
+struct cgroup_file;
+struct wsr_state;
 
 #ifdef CONFIG_WORKINGSET_REPORT
 
@@ -40,7 +42,10 @@ struct wsr_state {
unsigned long report_threshold;
unsigned long refresh_interval;
 
-   struct kernfs_node *page_age_sys_file;
+   union {
+   struct kernfs_node *page_age_sys_file;
+   struct cgroup_file *page_age_cgroup_file;
+   };
 
/* breakdown of workingset by page age */
struct mutex page_age_lock;
@@ -60,6 +65,9 @@ void wsr_remove_sysfs(struct node *node);
  */
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
struct pglist_data *pgdat);
+
+int wsr_set_refresh_interval(struct wsr_state *wsr,
+unsigned long refresh_interval);
 #else
 static inline void wsr_init_lruvec(struct lruvec *lruvec)
 {
@@ -79,6 +87,11 @@ static inline void wsr_init_sysfs(struct node *node)
 static inline void wsr_remove_sysfs(struct node *node)
 {
 }
+static inline int wsr_set_refresh_interval(struct wsr_state *wsr,
+  unsigned long refresh_interval)
+{
+   return 0;
+}
 #endif /* CONFIG_WORKINGSET_REPORT */
 
 #endif /* _LINUX_WORKINGSET_REPORT_H */
diff --git a/mm/internal.h b/mm/internal.h
index 508b7d9937d6..50ca0c6e651c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -484,6 +484,8 @@ void set_task_reclaim_state(struct task_struct *task, 
struct reclaim_state *rs)

[PATCH v4 5/9] mm: add kernel aging thread for workingset reporting

2024-11-26 Thread Yuanchu Xie
For reliable and timely aging on memcgs, one has to read the page age
histograms on time. A kernel thread makes it easier by aging memcgs with
valid refresh_interval when they can be refreshed, and also reduces the
latency of any userspace consumers of the page age histogram.

The kerne aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING.
Debugging stats may be added in the future for when aging cannot
keep up with the configured refresh_interval.

Signed-off-by: Yuanchu Xie 
---
 include/linux/workingset_report.h |  10 ++-
 mm/Kconfig|   6 ++
 mm/Makefile   |   1 +
 mm/memcontrol.c   |   2 +-
 mm/workingset_report.c|  13 ++-
 mm/workingset_report_aging.c  | 127 ++
 6 files changed, 154 insertions(+), 5 deletions(-)
 create mode 100644 mm/workingset_report_aging.c

diff --git a/include/linux/workingset_report.h 
b/include/linux/workingset_report.h
index 616be6469768..f6bbde2a04c3 100644
--- a/include/linux/workingset_report.h
+++ b/include/linux/workingset_report.h
@@ -64,7 +64,15 @@ void wsr_remove_sysfs(struct node *node);
  * The next refresh time is stored in refresh_time.
  */
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
-   struct pglist_data *pgdat);
+   struct pglist_data *pgdat, unsigned long *refresh_time);
+
+#ifdef CONFIG_WORKINGSET_REPORT_AGING
+void wsr_wakeup_aging_thread(void);
+#else /* CONFIG_WORKINGSET_REPORT_AGING */
+static inline void wsr_wakeup_aging_thread(void)
+{
+}
+#endif /* CONFIG_WORKINGSET_REPORT_AGING */
 
 int wsr_set_refresh_interval(struct wsr_state *wsr,
 unsigned long refresh_interval);
diff --git a/mm/Kconfig b/mm/Kconfig
index be949786796d..a8def8c65610 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1310,6 +1310,12 @@ config WORKINGSET_REPORT
  This option exports stats and events giving the user more insight
  into its memory working set.
 
+config WORKINGSET_REPORT_AGING
+   bool "Workingset report kernel aging thread"
+   depends on WORKINGSET_REPORT
+   help
+ Performs aging on memcgs with their configured refresh intervals.
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index f5ef0768253a..3a282510f960 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o
+obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1032c6efc66..ea83f10b22a1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4462,7 +4462,7 @@ static int memory_ws_page_age_show(struct seq_file *m, 
void *v)
if (!READ_ONCE(wsr->page_age))
continue;
 
-   wsr_refresh_report(wsr, memcg, NODE_DATA(nid));
+   wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL);
mutex_lock(&wsr->page_age_lock);
if (!wsr->page_age)
goto unlock;
diff --git a/mm/workingset_report.c b/mm/workingset_report.c
index 1e1bdb8bf75b..dad539e602bb 100644
--- a/mm/workingset_report.c
+++ b/mm/workingset_report.c
@@ -283,7 +283,7 @@ static void copy_node_bins(struct pglist_data *pgdat,
 }
 
 bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root,
-   struct pglist_data *pgdat)
+   struct pglist_data *pgdat, unsigned long *refresh_time)
 {
struct wsr_page_age_histo *page_age;
unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval);
@@ -300,10 +300,14 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct 
mem_cgroup *root,
goto unlock;
if (page_age->timestamp &&
time_is_after_jiffies(page_age->timestamp + refresh_interval))
-   goto unlock;
+   goto time;
refresh_scan(wsr, root, pgdat, refresh_interval);
copy_node_bins(pgdat, page_age);
refresh_aggregate(page_age, root, pgdat);
+
+time:
+   if (refresh_time)
+   *refresh_time = page_age->timestamp + refresh_interval;
 unlock:
mutex_unlock(&wsr->page_age_lock);
return !!page_age;
@@ -386,6 +390,9 @@ int wsr_set_refresh_interval(struct wsr_state *wsr,
WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(refresh_interval));
 unlock:
mutex_unlock(&wsr->page_age_lock);
+   if (!err && refresh_interval &&
+   (!old_interval || jiffies_to_msecs(old_interval) > 
refresh_interval))
+   wsr_wakeup_aging_thread();
return err;
 }
 
@@ -491,7 +498,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct 
kobj_attribute *attr,
int re

[PATCH v4 9/9] virtio-balloon: add workingset reporting

2024-11-26 Thread Yuanchu Xie
Ballooning is a way to dynamically size a VM, and it requires guest
collaboration. The amount to balloon without adversely affecting guest
performance is hard to compute without clear metrics from the guest.

Workingset reporting can provide guidance to the host to allow better
collaborative ballooning, such that the host balloon controller can
properly gauge the amount of memory the guest is actively using, i.e.,
the working set.

A draft QEMU series [1] is being worked on. Currently it is able to
configure the workingset reporting bins, refresh_interval, and report
threshold. Through QMP or HMP, a balloon controller can request a
workingset report. There is also a script [2] exercising the QMP
interface with a visual breakdown of the guest's workingset size.

According to the OASIS VIRTIO v1.3, there's a new balloon device in the
works and this one I'm adding to is the "traditional" balloon. If the
existing balloon device is not the right place for new features. I'm
more than happy to add it to the new one as well.

For technical details, this patch adds the a generic mechanism into
workingset reporting infrastructure to allow other parts of the kernel
to receive workingset reports. Two virtqueues are added to the
virtio-balloon device, notification_vq and report_vq. The notification
virtqueue allows the host to configure the guest workingset reporting
parameters and request a report. The report virtqueue sends a working
set report to the host when one is requested or due to memory pressure.

The workingset reporting feature is gated by the compilation flag
CONFIG_WORKINGSET_REPORT and the balloon feature flag
VIRTIO_BALLOON_F_WS_REPORTING.

[1] https://github.com/Dummyc0m/qemu/tree/wsr
[2] https://gist.github.com/Dummyc0m/d45b4e1b0dda8f2bc6cd8cfb37cc7e34

Signed-off-by: Yuanchu Xie 
---
 drivers/virtio/virtio_balloon.c | 390 +++-
 include/linux/balloon_compaction.h  |   1 +
 include/linux/mmzone.h  |   4 +
 include/linux/workingset_report.h   |  66 -
 include/uapi/linux/virtio_balloon.h |  30 +++
 mm/workingset_report.c  |  89 ++-
 6 files changed, 566 insertions(+), 14 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index b36d2803674e..8eb300653dd8 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,8 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
VIRTIO_BALLOON_VQ_REPORTING,
+   VIRTIO_BALLOON_VQ_WORKING_SET,
+   VIRTIO_BALLOON_VQ_NOTIFY,
VIRTIO_BALLOON_VQ_MAX
 };
 
@@ -124,6 +127,23 @@ struct virtio_balloon {
spinlock_t wakeup_lock;
bool processing_wakeup_event;
u32 wakeup_signal_mask;
+
+#ifdef CONFIG_WORKINGSET_REPORT
+   struct virtqueue *working_set_vq, *notification_vq;
+
+   /* Protects node_id, wsr_receiver, and report_buf */
+   struct mutex wsr_report_lock;
+   int wsr_node_id;
+   struct wsr_receiver wsr_receiver;
+   /* Buffer to report to host */
+   struct virtio_balloon_working_set_report *report_buf;
+
+   /* Buffer to hold incoming notification from the host. */
+   struct virtio_balloon_working_set_notify *notify_buf;
+
+   struct work_struct update_balloon_working_set_work;
+   struct work_struct update_balloon_notification_work;
+#endif
 };
 
 #define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0)
@@ -339,8 +359,352 @@ static unsigned int leak_balloon(struct virtio_balloon 
*vb, size_t num)
return num_freed_pages;
 }
 
-static inline void update_stat(struct virtio_balloon *vb, int idx,
-  u16 tag, u64 val)
+#ifdef CONFIG_WORKINGSET_REPORT
+static bool wsr_is_configured(struct virtio_balloon *vb)
+{
+   if (node_online(READ_ONCE(vb->wsr_node_id)) &&
+   READ_ONCE(vb->wsr_receiver.wsr.refresh_interval) > 0 &&
+   READ_ONCE(vb->wsr_receiver.wsr.page_age) != NULL)
+   return true;
+   return false;
+}
+
+/* wsr_receiver callback */
+static void wsr_receiver_notify(struct wsr_receiver *receiver)
+{
+   int bin;
+   struct virtio_balloon *vb =
+   container_of(receiver, struct virtio_balloon, wsr_receiver);
+
+   /* if we fail to acquire the locks, send stale report */
+   if (!mutex_trylock(&vb->wsr_report_lock))
+   goto out;
+   if (!mutex_trylock(&receiver->wsr.page_age_lock))
+   goto out_unlock_report_buf;
+   if (!READ_ONCE(receiver->wsr.page_age))
+   goto out_unlock_page_age;
+
+   vb->report_buf->error = cpu_to_le32(0);
+   vb->report_buf->node_id = cpu_to_le32(vb->wsr_node_id);
+   for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) {
+   struct virtio_balloon_working_set_report_bin

[PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces

2024-11-26 Thread Yuanchu Xie
Add workingset reporting documentation for better discoverability of
its sysfs and memcg interfaces. Also document the required kernel
config to enable workingset reporting.

Signed-off-by: Yuanchu Xie 
---
 Documentation/admin-guide/mm/index.rst|   1 +
 .../admin-guide/mm/workingset_report.rst  | 105 ++
 2 files changed, 106 insertions(+)
 create mode 100644 Documentation/admin-guide/mm/workingset_report.rst

diff --git a/Documentation/admin-guide/mm/index.rst 
b/Documentation/admin-guide/mm/index.rst
index 8b35795b664b..61a2a347fc91 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -41,4 +41,5 @@ the Linux memory management.
swap_numa
transhuge
userfaultfd
+   workingset_report
zswap
diff --git a/Documentation/admin-guide/mm/workingset_report.rst 
b/Documentation/admin-guide/mm/workingset_report.rst
new file mode 100644
index ..0969513705c4
--- /dev/null
+++ b/Documentation/admin-guide/mm/workingset_report.rst
@@ -0,0 +1,105 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Workingset Report
+=
+Workingset report provides a view of memory coldness in user-defined
+time intervals, e.g. X bytes are Y milliseconds cold. It breaks down
+the user pages in the system per-NUMA node, per-memcg, for both
+anonymous and file pages into histograms that look like:
+::
+
+1000 anon=137368 file=24530
+2 anon=34342 file=0
+3 anon=353232 file=333608
+4 anon=407198 file=206052
+9223372036854775807 anon=4925624 file=892892
+
+The workingset reports can be used to drive proactive reclaim, by
+identifying the number of cold bytes in a memcg, then writing to
+``memory.reclaim``.
+
+Quick start
+===
+Build the kernel with the following configurations. The report relies
+on Multi-gen LRU for page coldness.
+
+* ``CONFIG_LRU_GEN=y``
+* ``CONFIG_LRU_GEN_ENABLED=y``
+* ``CONFIG_WORKINGSET_REPORT=y``
+
+Optionally, the aging kernel daemon can be enabled with the following
+configuration.
+* ``CONFIG_WORKINGSET_REPORT_AGING=y``
+
+Sysfs interfaces
+
+``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides
+a per-node page age histogram, showing an aggregate of the node's lruvecs.
+Reading this file causes a hierarchical aging of all lruvecs, scanning
+pages and creates a new Multi-gen LRU generation in each lruvec.
+For example:
+::
+
+1000 anon=0 file=0
+2000 anon=0 file=0
+10 anon=5533696 file=5566464
+18446744073709551615 anon=0 file=0
+
+``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals``
+is a comma-separated list of time in milliseconds that configures what
+the page age histogram uses for aggregation. For the above histogram,
+the intervals are::
+
+1000,2000,10
+
+``/sys/devices/system/node/nodeX/workingset_report/refresh_interval``
+defines the amount of time the report is valid for in milliseconds.
+When a report is still valid, reading the ``page_age`` file shows
+the existing valid report, instead of generating a new one.
+
+``/sys/devices/system/node/nodeX/workingset_report/report_threshold``
+specifies how often the userspace agent can be notified for node
+memory pressure, in milliseconds. When a node reaches its low
+watermarks and wakes up kswapd, programs waiting on ``page_age`` are
+woken up so they can read the histogram and make policy decisions.
+
+Memcg interface
+===
+While ``page_age_interval`` is defined per-node in sysfs, ``page_age``,
+``refresh_interval`` and ``report_threshold`` are available per-memcg.
+
+``/sys/fs/cgroup/.../memory.workingset.page_age``
+The memcg equivalent of the sysfs workingset page age histogram
+breaks down the workingset of this memcg and its children into
+page age intervals. Each node is prefixed with a node header and
+a newline. Non-proactive direct reclaim on this memcg can also
+wake up userspace agents that are waiting on this file.
+E.g.
+::
+
+N0
+1000 anon=0 file=0
+2000 anon=0 file=0
+3000 anon=0 file=0
+4000 anon=0 file=0
+5000 anon=0 file=0
+18446744073709551615 anon=0 file=0
+
+``/sys/fs/cgroup/.../memory.workingset.refresh_interval``
+The memcg equivalent of the sysfs refresh interval. A per-node
+number of how much time a page age histogram is valid for, in
+milliseconds.
+E.g.
+::
+
+echo N0=2000 > memory.workingset.refresh_interval
+
+``/sys/fs/cgroup/.../memory.workingset.report_threshold``
+The memcg equivalent of the sysfs report threshold. A per-node
+number of how often userspace agent waiting on the page age
+histogram can be woken up, in milliseconds.
+E.g.
+::
+
+echo N0=1000 > memory.workingset.report_threshold
-- 
2.47.0.338.g60cca15819-goog




[PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting

2024-11-26 Thread Yuanchu Xie
Add workingset reporting documentation for better discoverability of
its memcg interfaces. Point the memcg documentation to
Documentation/admin-guide/mm/workingset_report.rst for more details.

Signed-off-by: Yuanchu Xie 
---
 Documentation/admin-guide/cgroup-v2.rst | 35 +
 1 file changed, 35 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 2cb58daf3089..67a183f08245 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1784,6 +1784,41 @@ The following nested keys are defined.
Shows pressure stall information for memory. See
:ref:`Documentation/accounting/psi.rst ` for details.
 
+  memory.workingset.page_age
+   A read-only histogram which exists on non-root cgroups.
+
+   This breaks down the cgroup's memory footprint into different
+   types of memory and groups them per-node into user-defined coldness
+   bins.
+
+   The output format of memory.workingset.page_age is::
+
+ N0
+  type=
+  type=
+ ...
+ 18446744073709551615 type=
+
+   The type of memory can be anon, file, or new types added later.
+   Don't rely on the types remaining fixed.  See
+   :ref:`Documentation/admin-guide/mm/workingset_report.rst 
`
+   for details.
+
+  memory.workingset.refresh_interval
+   A read-write nested-keyed file which exists on non-root cgroups.
+
+   Setting it to a non-zero value for any node enables working set
+   reporting for that node.  The default is 0 for each node.   See
+   :ref:`Documentation/admin-guide/mm/workingset_report.rst 
`
+   for details.
+
+  memory.workingset.report_threshold
+   A read-write nested-keyed file which exists on non-root cgroups.
+
+   The amount of milliseconds to wait before reporting the working
+   set again.  The default is 0 for each node.  See
+   :ref:`Documentation/admin-guide/mm/workingset_report.rst 
`
+   for details.
 
 Usage Guidelines
 
-- 
2.47.0.338.g60cca15819-goog




[PATCH v4 0/9] mm: workingset reporting

2024-11-26 Thread Yuanchu Xie
This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace.
Another interesting idea might be hugepage workingset, so that we can
measure the proportion of hugepages backing cold memory. However, with
architectures like arm, there may be too many hugepage sizes leading to
a combinatorial explosion when exporting stats to the userspace.
Nonetheless, the kernel should provide a set of workingset interfaces
that is generic enough to accommodate the various use cases, and extensible
to potential future use cases.

Use cases
==
Job scheduling
On overcommitted hosts, workingset information improves efficiency and
reliability by allowing the job scheduler to have better stats on the
exact memory requirements of each job. This can manifest in efficiency by
landing more jobs on the same host or NUMA node. On the other hand, the
job scheduler can also ensure each node has a sufficient amount of memory
and does not enter direct reclaim or the kernel OOM path. With workingset
information and job priority, the userspace OOM killing or proactive
reclaim policy can kick in before the system is under memory pressure.
If the job shape is very different from the machine shape, knowing the
workingset per-node can also help inform page allocation policies.

Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.

Ballooning (similar to proactive reclaim)
The last patch of the series extends the virtio-balloon device to report
the guest workingset.
Balloon policies benefit from workingset to more precisely determine the
size of the memory balloon. On end-user devices where memory is scarce and
overcommitted, the balloon sizing in multiple VMs running on the same
device can be orchestrated with workingset reports from each one.
On the server side, workingset reporting allows the balloon controller to
inflate the balloon without causing too much file cache to be reclaimed in
the guest.

Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].

[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

Sysfs and Cgroup Interfaces
==
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.

1000 anon=137368 file=24530
2 anon=34342 file=0
3 anon=353232 file=333608
4 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892

Implementation
==
The reporting of user pages is based off of MGLRU, and therefore requires
CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
fine-grained workingset report, but we can already gather a lot of data
with just four generations. The workingset reporting mechanism is gated
behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
CONFIG_WORKINGSET_REPORT_AGING.

Benchmarks
==
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
compile and redis benchmarks from openbenchmarking.org. The policy and
runner is referred to as WMO (Workload Memory Optimization).
The results were based on v3 of the series, but v4 doesn't change the core
of the working set reporting and just adds the ballooning counterpart.

The timed Linux kernel compilation benchmark shows improvements in peak
memory usage with a policy of "swap out all bytes colder than 10 seconds
every 40 seconds". A swapfile is configured on SSD.

peak memory usage (with WMO): 4982.61328 MiB
peak memory usage (control): 9569.1367 MiB
peak memory reduction: 47.9%

Benchmark   | Experimental |Control 
| Experimental_Std_Dev | Control_Std_Dev
Timed Linux Kernel Compilation - allmodconfig (sec) | 708.48

Re: [PATCH v6 00/15] integrity: Introduce the Integrity Digest Cache

2024-11-26 Thread Roberto Sassu
On Tue, 2024-11-26 at 00:13 +, Eric Snowberg wrote:
> 
> > On Nov 19, 2024, at 3:49 AM, Roberto Sassu  
> > wrote:
> > 
> > From: Roberto Sassu 
> > 
> > The Integrity Digest Cache can also help IMA for appraisal. IMA can simply
> > lookup the calculated digest of an accessed file in the list of digests
> > extracted from package headers, after verifying the header signature. It is
> > sufficient to verify only one signature for all files in the package, as
> > opposed to verifying a signature for each file.
> 
> Is there a way to maintain integrity over time?  Today if a CVE is discovered 
> in a signed program, the program hash can be added to the blacklist keyring. 
> Later if IMA appraisal is used, the signature validation will fail just for 
> that 
> program.  With the Integrity Digest Cache, is there a way to do this?  

As far as I can see, the ima_check_blacklist() call is before
ima_appraise_measurement(). If it fails, appraisal with the Integrity
Digest Cache will not be done.

In the future, we might use the Integrity Digest Cache for blacklists
too. Since a digest cache is reset on a file/directory change, IMA
would have to revalidate the program digest against a new digest cache.

Thanks

Roberto




Re: [PATCH v6 07/15] digest_cache: Allow registration of digest list parsers

2024-11-26 Thread Roberto Sassu
On Mon, 2024-11-25 at 15:53 -0800, Luis Chamberlain wrote:
> On Tue, Nov 19, 2024 at 11:49:14AM +0100, Roberto Sassu wrote:
> > From: Roberto Sassu 
> > Introduce load_parser() to load a kernel module containing a
> > parser for the requested digest list format (compressed kernel modules are
> > supported). Kernel modules are searched in the
> > /lib/modules//security/integrity/digest_cache directory.
> > 
> > load_parser() calls ksys_finit_module() to load a kernel module directly
> > from the kernel. request_module() cannot be used at this point, since the
> > reference digests of modprobe and the linked libraries (required for IMA
> > appraisal) might not be yet available, resulting in modprobe execution
> > being denied.
> 
> You are doing a full solution implementation of loading modules in-kernel.
> Appraisals of modules is just part of the boot process, some module
> loading may need firmware to loading to get some functinality to work
> for example some firmware to get a network device up or a GPU driver.
> So module loading alone is not the only thing which may require
> IMA appraisal, and this solution only addresses modules. There are other
> things which may be needed other than firmware, eBPF programs are
> another example.

Firmware, eBPF programs and so on are supposed to be verified with
digest lists (or alternative methods, such as file signatures), once
the required digest list parsers are loaded.

The parser is an exceptional case, because user space cannot be
executed at this point. Once the parsers are loaded, verification of
everything else proceeds as normal. Fortunately, in most cases kernel
modules are signed, so digest lists are not required to verify them.

> It sounds more like you want to provide or extend LSM hooks fit your
> architecture and make kernel_read_file() LSM hooks optionally use it to
> fit this model.

As far as the LSM infrastructure is concerned, I'm not adding new LSM
hooks, nor extending/modifying the existing ones. The operations the
Integrity Digest Cache is doing match the usage expectation by LSM (net
denying access, as discussed with Paul Moore).

The Integrity Digest Cache is supposed to be used as a supporting tool
for other LSMs to do regular access control based on file data and
metadata integrity. In doing that, it still needs the LSM
infrastructure to notify about filesystem changes, and to store
additional information in the inode and file descriptor security blobs.

The kernel_post_read_file LSM hook should be implemented by another LSM
to verify the integrity of a digest list, when the Integrity Digest
Cache calls kernel_read_file() to read that digest list. That LSM is
also responsible to provide the result of the integrity verification to
the Integrity Digest Cache, so that the latter can give this
information back to whoever wants to do a digest lookup from that
digest list and also wants to know whether or not the digest list was
authentic.

> Because this is just for a *phase* in boot, which you've caught because
> a catch-22 situaton, where you didn't have your parsers loaded. Which is
> just a reflection that you hit that snag. It doesn't prove all snags
> will be caught yet.

Yes, that didn't happen earlier, since all the parsers were compiled
built-in in the kernel. The Integrity Digest Cache already has a
deadlock avoidance mechanism for digest lists.

Supporting kernel modules opened the road for new deadlocks, since one
can ask a digest list to verify a kernel module, but that digest list
requires the same kernel module. That is why the in-kernel mechanism is
100% reliable, because the Integrity Digest Cache marks the file
descriptors it opens, and can recognize them, when those file
descriptors are passed back to it by other LSMs (e.g. through the
kernel_post_read_file LSM hook).

> And you only want to rely on this .. in-kernel loading solution only
> early on boot, is there a way to change this over to enable regular
> operation later?

User space can voluntarily load new digest list parsers, but the
Integrity Digest Cache cannot rely on it to be done. Also, using
request_module() does not seem a good idea, since it wouldn't allow the
Integrity Digest Cache to mark the file descriptor of kernel modules,
and thus the Integrity Digest Cache could not determine whether or not
a deadlock is happening.

Thanks

Roberto