Re: [PATCH v6 07/15] digest_cache: Allow registration of digest list parsers
On Tue, Nov 26, 2024 at 11:25:07AM +0100, Roberto Sassu wrote: > On Mon, 2024-11-25 at 15:53 -0800, Luis Chamberlain wrote: > > Firmware, eBPF programs and so on are supposed Keyword: "supposed". > As far as the LSM infrastructure is concerned, I'm not adding new LSM > hooks, nor extending/modifying the existing ones. The operations the > Integrity Digest Cache is doing match the usage expectation by LSM (net > denying access, as discussed with Paul Moore). If modules are the only proven exception to your security model you are not making the case for it clearly. > The Integrity Digest Cache is supposed to be used as a supporting tool > for other LSMs to do regular access control based on file data and > metadata integrity. In doing that, it still needs the LSM > infrastructure to notify about filesystem changes, and to store > additional information in the inode and file descriptor security blobs. > > The kernel_post_read_file LSM hook should be implemented by another LSM > to verify the integrity of a digest list, when the Integrity Digest > Cache calls kernel_read_file() to read that digest list. If LSM folks *do* agree that this work is *suplementing* LSMS then sure, it was not clear from the commit logs. But then you need to ensure the parsers are special snowflakes which won't ever incur other additional kernel_read_file() calls. > Supporting kernel modules opened the road for new deadlocks, since one > can ask a digest list to verify a kernel module, but that digest list > requires the same kernel module. That is why the in-kernel mechanism is > 100% reliable, Are users of this infrastructure really in need of modules for these parsers? Luis
Re: [PATCH] docs: media: document media multi-committers rules and process
Hi Mauro and Hans, On Mon, Nov 25, 2024 at 02:28:58PM +0100, Mauro Carvalho Chehab wrote: > As the media subsystem will experiment with a multi-committers model, > update the Maintainer's entry profile to the new rules, and add a file > documenting the process to become a committer and to maintain such > rights. > > Signed-off-by: Mauro Carvalho Chehab > Signed-off-by: Hans Verkuil > --- > Documentation/driver-api/media/index.rst | 1 + > .../media/maintainer-entry-profile.rst| 193 ++ > .../driver-api/media/media-committer.rst | 252 ++ > .../process/maintainer-pgp-guide.rst | 2 + > 4 files changed, 398 insertions(+), 50 deletions(-) > create mode 100644 Documentation/driver-api/media/media-committer.rst > > diff --git a/Documentation/driver-api/media/index.rst > b/Documentation/driver-api/media/index.rst > index d5593182a3f9..d0c725fcbc67 100644 > --- a/Documentation/driver-api/media/index.rst > +++ b/Documentation/driver-api/media/index.rst > @@ -26,6 +26,7 @@ Documentation/userspace-api/media/index.rst > :numbered: > > maintainer-entry-profile > +media-committer > > v4l2-core > dtv-core > diff --git a/Documentation/driver-api/media/maintainer-entry-profile.rst > b/Documentation/driver-api/media/maintainer-entry-profile.rst > index ffc712a5f632..90c6c0d9cf17 100644 > --- a/Documentation/driver-api/media/maintainer-entry-profile.rst > +++ b/Documentation/driver-api/media/maintainer-entry-profile.rst > @@ -27,19 +27,128 @@ It covers, mainly, the contents of those directories: > Both media userspace and Kernel APIs are documented and the documentation > must be kept in sync with the API changes. It means that all patches that > add new features to the subsystem must also bring changes to the > -corresponding API files. > +corresponding API documentation files. I would have split this kind of small changes to a separate patch to make reviews easier, but that's not a big deal. > > -Due to the size and wide scope of the media subsystem, media's > -maintainership model is to have sub-maintainers that have a broad > -knowledge of a specific aspect of the subsystem. It is the sub-maintainers' > -task to review the patches, providing feedback to users if the patches are > +Due to the size and wide scope of the media subsystem, the media's > +maintainership model is to have committers that have a broad knowledge of > +a specific aspect of the subsystem. It is the committers' task to > +review the patches, providing feedback to users if the patches are > following the subsystem rules and are properly using the media kernel and > userspace APIs. This sounds really like a maintainer definition. I won't bikeshed too much on the wording though, we will always be able to adjust it later to reflect the reality of the situation as it evolves. I do like the removal of the "sub-maintainer" term though, as I've always found it demeaning. > > -Patches for the media subsystem must be sent to the media mailing list > -at linux-me...@vger.kernel.org as plain text only e-mail. Emails with > -HTML will be automatically rejected by the mail server. It could be wise > -to also copy the sub-maintainer(s). > +Media committers > + > + > +In the media subsystem, there are experienced developers that can commit s/that/who/ s/commit/push/ to standardize the vocabulary (below you use "upload" to mean the same thing) > +patches directly on a development tree. These developers are called s/on a/to the/ > +Media committers and are divided into the following categories: > + > +- Committers: responsible for one or more drivers within the media subsystem. > + They can upload changes to the tree that do not affect the core or ABI. s/upload/push/ > + > +- Core committers: responsible for part of the media core. They are typically > + responsible for one or more drivers within the media subsystem, but, > besides > + that, they can also merge patches that change the code common to multiple > + drivers, including the kernel internal API/ABI. I would write "API" only here. Neither the kernel internal API nor its internal ABI are stable, and given that lack of stability, the ABI concept doesn't really apply within the kernel. > + > +- Subsystem maintainers: responsible for the subsystem as a whole, with > + access to the entire subsystem. > + > + Only subsystem maintainers can change the userspace API/ABI. This can give the impression that only subsystem maintainers are allowed to work on the API. I would write Only subsystem maintainers change push changes that affect the userspace API/ABI. > + > +Media committers shall explicitly agree with the Kernel development process Do we have to capitalize "Kernel" everywhere ? There are way more occurrences of "kernel" than "Kernel" in Documentation/ (even excluding the lower case occurrences in e-mail addresses, file paths, ...). > +as described at Documentation
Re: [PATCH v4 0/9] mm: workingset reporting
On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote: > This patch series provides workingset reporting of user pages in > lruvecs, of which coldness can be tracked by accessed bits and fd > references. However, the concept of workingset applies generically to > all types of memory, which could be kernel slab caches, discardable > userspace caches (databases), or CXL.mem. Therefore, data sources might > come from slab shrinkers, device drivers, or the userspace. > Another interesting idea might be hugepage workingset, so that we can > measure the proportion of hugepages backing cold memory. However, with > architectures like arm, there may be too many hugepage sizes leading to > a combinatorial explosion when exporting stats to the userspace. > Nonetheless, the kernel should provide a set of workingset interfaces > that is generic enough to accommodate the various use cases, and extensible > to potential future use cases. Doesn't DAMON already provide this information? CCing SJ. > Use cases > == > Job scheduling > On overcommitted hosts, workingset information improves efficiency and > reliability by allowing the job scheduler to have better stats on the > exact memory requirements of each job. This can manifest in efficiency by > landing more jobs on the same host or NUMA node. On the other hand, the > job scheduler can also ensure each node has a sufficient amount of memory > and does not enter direct reclaim or the kernel OOM path. With workingset > information and job priority, the userspace OOM killing or proactive > reclaim policy can kick in before the system is under memory pressure. > If the job shape is very different from the machine shape, knowing the > workingset per-node can also help inform page allocation policies. > > Proactive reclaim > Workingset information allows the a container manager to proactively > reclaim memory while not impacting a job's performance. While PSI may > provide a reactive measure of when a proactive reclaim has reclaimed too > much, workingset reporting allows the policy to be more accurate and > flexible. I'm not sure about more accurate. Access frequency is only half the picture. Whether you need to keep memory with a given frequency resident depends on the speed of the backing device. There is memory compression; there is swap on flash; swap on crappy flash; swapfiles that share IOPS with co-located filesystems. There is zswap+writeback, where avg refault speed can vary dramatically. You can of course offload much more to a fast zswap backend than to a swapfile on a struggling flashdrive, with comparable app performance. So I think you'd be hard pressed to achieve a high level of accuracy in the usecases you list without taking the (often highly dynamic) cost of paging / memory transfer into account. There is a more detailed discussion of this in a paper we wrote on proactive reclaim/offloading - in 2.5 Hardware Heterogeneity: https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf > Ballooning (similar to proactive reclaim) > The last patch of the series extends the virtio-balloon device to report > the guest workingset. > Balloon policies benefit from workingset to more precisely determine the > size of the memory balloon. On end-user devices where memory is scarce and > overcommitted, the balloon sizing in multiple VMs running on the same > device can be orchestrated with workingset reports from each one. > On the server side, workingset reporting allows the balloon controller to > inflate the balloon without causing too much file cache to be reclaimed in > the guest. > > Promotion/Demotion > If different mechanisms are used for promition and demotion, workingset > information can help connect the two and avoid pages being migrated back > and forth. > For example, given a promotion hot page threshold defined in reaccess > distance of N seconds (promote pages accessed more often than every N > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > the fast memory node passes the threshold. This calculation can be done > with workingset reports. > To be directly useful for promotion policies, the workingset report > interfaces need to be extended to report hotness and gather hotness > information from the devices[1]. > > [1] > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > Sysfs and Cgroup Interfaces > == > The interfaces are detailed in the patches that introduce them. The main > idea here is we break down the workingset per-node per-memcg into time > intervals (ms), e.g. > > 1000 anon=137368 file=24530 > 2 anon=34342 file=0 > 3 anon=353232 file=333608 > 4 anon=407198 file=206052 > 9223372036854775807 anon=4925624 file=892892 > > Implementation > == > The reporting of user pages is based off of MGLRU, and therefore requires > CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more > fine-grained workingset rep
Re: [PATCH v4 1/9] mm: aggregate workingset information into histograms
On Tue, Nov 26, 2024 at 06:57:20PM -0800, Yuanchu Xie wrote: > diff --git a/mm/internal.h b/mm/internal.h > index 64c2eb0b160e..bbd3c1501bac 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -470,9 +470,14 @@ extern unsigned long highest_memmap_pfn; > /* > * in mm/vmscan.c: > */ > +struct scan_control; > +bool isolate_lru_page(struct page *page); Is this a mismerge? It doesn't exist any more.
[PATCH v4 1/9] mm: aggregate workingset information into histograms
Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's workingset per-node, per-anon/file. The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/workingset_report/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 10 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0 /sys/devices/system/node/nodeX/workingset_report/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation. Signed-off-by: Yuanchu Xie --- drivers/base/node.c | 6 + include/linux/mmzone.h| 9 + include/linux/workingset_report.h | 79 ++ mm/Kconfig| 9 + mm/Makefile | 1 + mm/internal.h | 5 + mm/memcontrol.c | 2 + mm/mm_init.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 10 +- mm/workingset_report.c| 451 ++ 11 files changed, 572 insertions(+), 4 deletions(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..ba5b8720dbfa 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include static const struct bus_type node_subsys = { .name = "node", @@ -626,6 +628,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_init_sysfs(node); } return error; @@ -642,6 +645,9 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_remove_sysfs(node); + wsr_destroy_lruvec(mem_cgroup_lruvec(NULL, NODE_DATA(node->dev.id))); + wsr_destroy_pgdat(NODE_DATA(node->dev.id)); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 80bc5640bb60..ee728c0c5a3b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include #include #include +#include /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -630,6 +631,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_statewsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif @@ -1424,6 +1428,11 @@ typedef struct pglist_data { struct lru_gen_memcg memcg_lru; #endif +#ifdef CONFIG_WORKINGSET_REPORT + struct mutex wsr_update_mutex; + struct wsr_report_bins __rcu *wsr_page_age_bins; +#endif + CACHELINE_PADDING(_pad2_); /* Per-node vmstats */ diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index ..d7c2ee14ec87 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include +#include + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + /* excludes the WORKINGSET_INTERVAL_MAX bin */ + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + unsigned long idle_age[WORKINGSET_REPORT_MAX_NR_BINS]; + struct rcu_head rcu; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init_lruvec(struct lruvec *lruvec); +void wsr_destroy_lruvec(struct lruvec *lruvec); +void wsr_init_pgdat(struct pglist_data *pgdat); +void wsr_destroy_pgdat(struct pglist_dat
[PATCH v4 2/9] mm: use refresh interval to rate-limit workingset report aggregation
The refresh interval is a rate limiting factor to workingset page age histogram reads. When a workingset report is generated, the oldest timestamp of all the lruvecs is stored as the timestamp of the report. The same report will be read until the report expires beyond the refresh interval, at which point a new report is generated. Sysfs interface /sys/devices/system/node/nodeX/workingset_report/refresh_interval time in milliseconds specifying how long the report is valid for Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 1 + mm/workingset_report.c| 101 -- 2 files changed, 83 insertions(+), 19 deletions(-) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index d7c2ee14ec87..8bae6a600410 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -37,6 +37,7 @@ struct wsr_page_age_histo { }; struct wsr_state { + unsigned long refresh_interval; /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index a4dcf62fcd96..8678536ccfc7 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -174,9 +174,11 @@ static void collect_page_age_type(const struct lru_gen_folio *lrugen, * Assume the heuristic that pages are in the MGLRU generation * through uniform accesses, so we can aggregate them * proportionally into bins. + * + * Returns the timestamp of the youngest gen in this lruvec. */ -static void collect_page_age(struct wsr_page_age_histo *page_age, -const struct lruvec *lruvec) +static unsigned long collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) { int type; const struct lru_gen_folio *lrugen = &lruvec->lrugen; @@ -191,11 +193,14 @@ static void collect_page_age(struct wsr_page_age_histo *page_age, for (type = 0; type < ANON_AND_FILE; type++) collect_page_age_type(lrugen, bin, max_seq, min_seq[type], curr_timestamp, type); + + return READ_ONCE(lruvec->lrugen.timestamps[lru_gen_from_seq(max_seq)]); } /* First step: hierarchically scan child memcgs. */ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, -struct pglist_data *pgdat) +struct pglist_data *pgdat, +unsigned long refresh_interval) { struct mem_cgroup *memcg; unsigned int flags; @@ -208,12 +213,15 @@ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + int gen = lru_gen_from_seq(max_seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); /* * setting can_swap=true and force_scan=true ensures * proper workingset stats when the system cannot swap. */ - try_to_inc_max_seq(lruvec, max_seq, true, true); + if (time_is_before_jiffies(birth + refresh_interval)) + try_to_inc_max_seq(lruvec, max_seq, true, true); cond_resched(); } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); @@ -228,6 +236,7 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, { struct mem_cgroup *memcg; struct wsr_report_bin *bin; + unsigned long oldest_lruvec_time = jiffies; for (bin = page_age->bins; bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { @@ -241,11 +250,15 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, memcg = mem_cgroup_iter(root, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + unsigned long lruvec_time = + collect_page_age(page_age, lruvec); + + if (time_before(lruvec_time, oldest_lruvec_time)) + oldest_lruvec_time = lruvec_time; - collect_page_age(page_age, lruvec); cond_resched(); } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); - WRITE_ONCE(page_age->timestamp, jiffies); + WRITE_ONCE(page_age->timestamp, oldest_lruvec_time); } static void copy_node_bins(struct pglist_data *pgdat, @@ -270,17 +283,25 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat) { struct wsr_page_age_histo *page_age; + unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); if (!READ_ONCE(wsr->page_age)) return false; -
[PATCH v4 3/9] mm: report workingset during memory pressure driven scanning
When a node reaches its low watermarks and wakes up kswapd, notify all userspace programs waiting on the workingset page age histogram of the memory pressure, so a userspace agent can read the workingset report in time and make policy decisions, such as logging, oom-killing, or migration. Sysfs interface: /sys/devices/system/node/nodeX/workingset_report/report_threshold time in milliseconds that specifies how often the userspace agent can be notified for node memory pressure. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 4 +++ mm/internal.h | 12 mm/vmscan.c | 46 +++ mm/workingset_report.c| 43 - 4 files changed, 104 insertions(+), 1 deletion(-) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 8bae6a600410..2ec8b927b200 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -37,7 +37,11 @@ struct wsr_page_age_histo { }; struct wsr_state { + unsigned long report_threshold; unsigned long refresh_interval; + + struct kernfs_node *page_age_sys_file; + /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index bbd3c1501bac..508b7d9937d6 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -479,6 +479,18 @@ bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq, bool can_swap, bool force_scan); void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs); +#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); +#else +static inline void notify_workingset(struct mem_cgroup *memcg, +struct pglist_data *pgdat) +{ +} +#endif + /* * in mm/rmap.c: */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 89da4d8dfb5f..2bca81271d15 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2578,6 +2578,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); } +#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc); +#else +static inline void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + #ifdef CONFIG_LRU_GEN #ifdef CONFIG_LRU_GEN_ENABLED @@ -4004,6 +4013,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) set_initial_priority(pgdat, sc); + try_to_report_workingset(pgdat, sc); + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); @@ -5649,6 +5660,38 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); +#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, +struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + unsigned long threshold = READ_ONCE(wsr->report_threshold); + + if (sc->priority == DEF_PRIORITY) + return; + + if (!threshold) + return; + + if (!mutex_trylock(&wsr->page_age_lock)) + return; + + if (!wsr->page_age) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + mutex_unlock(&wsr->page_age_lock); + notify_workingset(memcg, pgdat); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) @@ -6200,6 +6243,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; + + if (!sc->proactive) + try_to_report_workingset(zone->zone_pgdat, sc); shrink_node(zone->zone_pgdat, sc); } diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 8678536ccfc7..bbefb0046669 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -320,6 +320,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; } +static ssize_t report_threshold_show(struct kobject *kobj, +struct kobj_attribute *attr, char *buf) +{ + struct wsr_s
[PATCH v4 6/9] selftest: test system-wide workingset reporting
A basic test that verifies the working set size of a simple memory accessor. It should work with or without the aging thread. When running tests with run_vmtests.sh, file workingset report testing requires an environment variable WORKINGSET_REPORT_TEST_FILE_PATH to store a temporary file, which is passed into the test invocation as a parameter. Signed-off-by: Yuanchu Xie --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + tools/testing/selftests/mm/run_vmtests.sh | 5 + .../testing/selftests/mm/workingset_report.c | 306 .../testing/selftests/mm/workingset_report.h | 39 +++ .../selftests/mm/workingset_report_test.c | 330 ++ 6 files changed, 684 insertions(+) create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index da030b43e43b..e5cd0085ab74 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -51,3 +51,4 @@ hugetlb_madv_vs_map mseal_test seal_elf droppable +workingset_report_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 0f8c110e0805..5c6a7464da6e 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv TEST_GEN_FILES += hugetlb_madv_vs_map TEST_GEN_FILES += hugetlb_dio TEST_GEN_FILES += droppable +TEST_GEN_FILES += workingset_report_test ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty @@ -138,6 +139,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c +$(OUTPUT)/workingset_report_test: workingset_report.c + ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh index c5797ad1d37b..63782667381a 100755 --- a/tools/testing/selftests/mm/run_vmtests.sh +++ b/tools/testing/selftests/mm/run_vmtests.sh @@ -75,6 +75,8 @@ separated by spaces: read-only VMAs - mdwe test prctl(PR_SET_MDWE, ...) +- workingset_report + test workingset reporting example: ./run_vmtests.sh -t "hmm mmap ksm" EOF @@ -456,6 +458,9 @@ CATEGORY="mkdirty" run_test ./mkdirty CATEGORY="mdwe" run_test ./mdwe_test +CATEGORY="workingset_report" run_test ./workingset_report_test \ + "${WORKINGSET_REPORT_TEST_FILE_PATH}" + echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix echo "1..${count_total}" | tap_output diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c new file mode 100644 index ..ee4dda5c371d --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.c @@ -0,0 +1,306 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online" +#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches" + +/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{ + ssize_t len; + int fd, err; + size_t bytes_read = 0; + + if (!max_len) + return -EINVAL; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + while (bytes_read < max_len - 1) { + len = read(fd, buf + bytes_read, max_len - 1 - bytes_read); + + if (len <= 0) + break; + bytes_read += len; + } + + buf[bytes_read] = '\0'; + + err = -errno; + close(fd); + return len < 0 ? err : bytes_read; +} + +/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, const char *buf, ssize_t max_len) +{ + int fd, len, err; + size_t bytes_written = 0; + + fd = open(path, O_WRONLY | O_APPEND); + if (fd < 0) + return -errno; + + while (bytes_written < max_len) { + len = write(fd, buf + bytes_written, max_len - bytes_written); + + if (len < 0) + break; + bytes_written += len; + } + + err = -errno; + close(fd); + return len < 0 ? err : bytes_written; +} + +static long read_num(const char *path) +{ + char buf[21]; + + if (read_text(path, buf, sizeof(buf)) <= 0) + return -1; + return (long)strtoul(buf, NULL, 10); +} + +static int writ
[PATCH v4 4/9] mm: extend workingset reporting to memcgs
Break down the system-wide workingset report into per-memcg reports, which aggregages its children hierarchically. The per-node workingset reporting histograms and refresh/report threshold files are presented as memcg files, showing a report containing all the nodes. The per-node page age interval is configurable in sysfs and not available per-memcg, while the refresh interval and report threshold are configured per-memcg. Memcg interface: /sys/fs/cgroup/.../memory.workingset.page_age The memcg equivalent of the sysfs workingset page age histogram breaks down the workingset of this memcg and its children into page age intervals. Each node is prefixed with a node header and a newline. Non-proactive direct reclaim on this memcg can also wake up userspace agents that are waiting on this file. e.g. N0 1000 anon=0 file=0 2000 anon=0 file=0 3000 anon=0 file=0 4000 anon=0 file=0 5000 anon=0 file=0 18446744073709551615 anon=0 file=0 /sys/fs/cgroup/.../memory.workingset.refresh_interval The memcg equivalent of the sysfs refresh interval. A per-node number of how much time a page age histogram is valid for, in milliseconds. e.g. echo N0=2000 > memory.workingset.refresh_interval /sys/fs/cgroup/.../memory.workingset.report_threshold The memcg equivalent of the sysfs report threshold. A per-node number of how often userspace agent waiting on the page age histogram can be woken up, in milliseconds. e.g. echo N0=1000 > memory.workingset.report_threshold Signed-off-by: Yuanchu Xie --- include/linux/memcontrol.h| 21 include/linux/workingset_report.h | 15 ++- mm/internal.h | 2 + mm/memcontrol.c | 160 +- mm/workingset_report.c| 50 +++--- 5 files changed, 230 insertions(+), 18 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e1b41554a5fb..fd595b33a54d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -323,6 +323,11 @@ struct mem_cgroup { spinlock_t event_list_lock; #endif /* CONFIG_MEMCG_V1 */ +#ifdef CONFIG_WORKINGSET_REPORT + /* memory.workingset.page_age file */ + struct cgroup_file workingset_page_age_file; +#endif + struct mem_cgroup_per_node *nodeinfo[]; }; @@ -1094,6 +1099,16 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, void split_page_memcg(struct page *head, int old_order, int new_order); +static inline struct cgroup_file * +mem_cgroup_page_age_file(struct mem_cgroup *memcg) +{ +#ifdef CONFIG_WORKINGSET_REPORT + return &memcg->workingset_page_age_file; +#else + return NULL; +#endif +} + #else /* CONFIG_MEMCG */ #define MEM_CGROUP_ID_SHIFT0 @@ -1511,6 +1526,12 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx) static inline void split_page_memcg(struct page *head, int old_order, int new_order) { } + +static inline struct cgroup_file * +mem_cgroup_page_age_file(struct mem_cgroup *memcg) +{ + return NULL; +} #endif /* CONFIG_MEMCG */ /* diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 2ec8b927b200..616be6469768 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -9,6 +9,8 @@ struct mem_cgroup; struct pglist_data; struct node; struct lruvec; +struct cgroup_file; +struct wsr_state; #ifdef CONFIG_WORKINGSET_REPORT @@ -40,7 +42,10 @@ struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval; - struct kernfs_node *page_age_sys_file; + union { + struct kernfs_node *page_age_sys_file; + struct cgroup_file *page_age_cgroup_file; + }; /* breakdown of workingset by page age */ struct mutex page_age_lock; @@ -60,6 +65,9 @@ void wsr_remove_sysfs(struct node *node); */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat); + +int wsr_set_refresh_interval(struct wsr_state *wsr, +unsigned long refresh_interval); #else static inline void wsr_init_lruvec(struct lruvec *lruvec) { @@ -79,6 +87,11 @@ static inline void wsr_init_sysfs(struct node *node) static inline void wsr_remove_sysfs(struct node *node) { } +static inline int wsr_set_refresh_interval(struct wsr_state *wsr, + unsigned long refresh_interval) +{ + return 0; +} #endif /* CONFIG_WORKINGSET_REPORT */ #endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/internal.h b/mm/internal.h index 508b7d9937d6..50ca0c6e651c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -484,6 +484,8 @@ void set_task_reclaim_state(struct task_struct *task, struct reclaim_state *rs)
[PATCH v4 5/9] mm: add kernel aging thread for workingset reporting
For reliable and timely aging on memcgs, one has to read the page age histograms on time. A kernel thread makes it easier by aging memcgs with valid refresh_interval when they can be refreshed, and also reduces the latency of any userspace consumers of the page age histogram. The kerne aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING. Debugging stats may be added in the future for when aging cannot keep up with the configured refresh_interval. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 10 ++- mm/Kconfig| 6 ++ mm/Makefile | 1 + mm/memcontrol.c | 2 +- mm/workingset_report.c| 13 ++- mm/workingset_report_aging.c | 127 ++ 6 files changed, 154 insertions(+), 5 deletions(-) create mode 100644 mm/workingset_report_aging.c diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 616be6469768..f6bbde2a04c3 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -64,7 +64,15 @@ void wsr_remove_sysfs(struct node *node); * The next refresh time is stored in refresh_time. */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat); + struct pglist_data *pgdat, unsigned long *refresh_time); + +#ifdef CONFIG_WORKINGSET_REPORT_AGING +void wsr_wakeup_aging_thread(void); +#else /* CONFIG_WORKINGSET_REPORT_AGING */ +static inline void wsr_wakeup_aging_thread(void) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT_AGING */ int wsr_set_refresh_interval(struct wsr_state *wsr, unsigned long refresh_interval); diff --git a/mm/Kconfig b/mm/Kconfig index be949786796d..a8def8c65610 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1310,6 +1310,12 @@ config WORKINGSET_REPORT This option exports stats and events giving the user more insight into its memory working set. +config WORKINGSET_REPORT_AGING + bool "Workingset report kernel aging thread" + depends on WORKINGSET_REPORT + help + Performs aging on memcgs with their configured refresh intervals. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index f5ef0768253a..3a282510f960 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -99,6 +99,7 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o +obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d1032c6efc66..ea83f10b22a1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4462,7 +4462,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) if (!READ_ONCE(wsr->page_age)) continue; - wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) goto unlock; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 1e1bdb8bf75b..dad539e602bb 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -283,7 +283,7 @@ static void copy_node_bins(struct pglist_data *pgdat, } bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, unsigned long *refresh_time) { struct wsr_page_age_histo *page_age; unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); @@ -300,10 +300,14 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, goto unlock; if (page_age->timestamp && time_is_after_jiffies(page_age->timestamp + refresh_interval)) - goto unlock; + goto time; refresh_scan(wsr, root, pgdat, refresh_interval); copy_node_bins(pgdat, page_age); refresh_aggregate(page_age, root, pgdat); + +time: + if (refresh_time) + *refresh_time = page_age->timestamp + refresh_interval; unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; @@ -386,6 +390,9 @@ int wsr_set_refresh_interval(struct wsr_state *wsr, WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(refresh_interval)); unlock: mutex_unlock(&wsr->page_age_lock); + if (!err && refresh_interval && + (!old_interval || jiffies_to_msecs(old_interval) > refresh_interval)) + wsr_wakeup_aging_thread(); return err; } @@ -491,7 +498,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, int re
[PATCH v4 9/9] virtio-balloon: add workingset reporting
Ballooning is a way to dynamically size a VM, and it requires guest collaboration. The amount to balloon without adversely affecting guest performance is hard to compute without clear metrics from the guest. Workingset reporting can provide guidance to the host to allow better collaborative ballooning, such that the host balloon controller can properly gauge the amount of memory the guest is actively using, i.e., the working set. A draft QEMU series [1] is being worked on. Currently it is able to configure the workingset reporting bins, refresh_interval, and report threshold. Through QMP or HMP, a balloon controller can request a workingset report. There is also a script [2] exercising the QMP interface with a visual breakdown of the guest's workingset size. According to the OASIS VIRTIO v1.3, there's a new balloon device in the works and this one I'm adding to is the "traditional" balloon. If the existing balloon device is not the right place for new features. I'm more than happy to add it to the new one as well. For technical details, this patch adds the a generic mechanism into workingset reporting infrastructure to allow other parts of the kernel to receive workingset reports. Two virtqueues are added to the virtio-balloon device, notification_vq and report_vq. The notification virtqueue allows the host to configure the guest workingset reporting parameters and request a report. The report virtqueue sends a working set report to the host when one is requested or due to memory pressure. The workingset reporting feature is gated by the compilation flag CONFIG_WORKINGSET_REPORT and the balloon feature flag VIRTIO_BALLOON_F_WS_REPORTING. [1] https://github.com/Dummyc0m/qemu/tree/wsr [2] https://gist.github.com/Dummyc0m/d45b4e1b0dda8f2bc6cd8cfb37cc7e34 Signed-off-by: Yuanchu Xie --- drivers/virtio/virtio_balloon.c | 390 +++- include/linux/balloon_compaction.h | 1 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 66 - include/uapi/linux/virtio_balloon.h | 30 +++ mm/workingset_report.c | 89 ++- 6 files changed, 566 insertions(+), 14 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index b36d2803674e..8eb300653dd8 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -18,6 +18,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -45,6 +46,8 @@ enum virtio_balloon_vq { VIRTIO_BALLOON_VQ_STATS, VIRTIO_BALLOON_VQ_FREE_PAGE, VIRTIO_BALLOON_VQ_REPORTING, + VIRTIO_BALLOON_VQ_WORKING_SET, + VIRTIO_BALLOON_VQ_NOTIFY, VIRTIO_BALLOON_VQ_MAX }; @@ -124,6 +127,23 @@ struct virtio_balloon { spinlock_t wakeup_lock; bool processing_wakeup_event; u32 wakeup_signal_mask; + +#ifdef CONFIG_WORKINGSET_REPORT + struct virtqueue *working_set_vq, *notification_vq; + + /* Protects node_id, wsr_receiver, and report_buf */ + struct mutex wsr_report_lock; + int wsr_node_id; + struct wsr_receiver wsr_receiver; + /* Buffer to report to host */ + struct virtio_balloon_working_set_report *report_buf; + + /* Buffer to hold incoming notification from the host. */ + struct virtio_balloon_working_set_notify *notify_buf; + + struct work_struct update_balloon_working_set_work; + struct work_struct update_balloon_notification_work; +#endif }; #define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0) @@ -339,8 +359,352 @@ static unsigned int leak_balloon(struct virtio_balloon *vb, size_t num) return num_freed_pages; } -static inline void update_stat(struct virtio_balloon *vb, int idx, - u16 tag, u64 val) +#ifdef CONFIG_WORKINGSET_REPORT +static bool wsr_is_configured(struct virtio_balloon *vb) +{ + if (node_online(READ_ONCE(vb->wsr_node_id)) && + READ_ONCE(vb->wsr_receiver.wsr.refresh_interval) > 0 && + READ_ONCE(vb->wsr_receiver.wsr.page_age) != NULL) + return true; + return false; +} + +/* wsr_receiver callback */ +static void wsr_receiver_notify(struct wsr_receiver *receiver) +{ + int bin; + struct virtio_balloon *vb = + container_of(receiver, struct virtio_balloon, wsr_receiver); + + /* if we fail to acquire the locks, send stale report */ + if (!mutex_trylock(&vb->wsr_report_lock)) + goto out; + if (!mutex_trylock(&receiver->wsr.page_age_lock)) + goto out_unlock_report_buf; + if (!READ_ONCE(receiver->wsr.page_age)) + goto out_unlock_page_age; + + vb->report_buf->error = cpu_to_le32(0); + vb->report_buf->node_id = cpu_to_le32(vb->wsr_node_id); + for (bin = 0; bin < WORKINGSET_REPORT_MAX_NR_BINS; ++bin) { + struct virtio_balloon_working_set_report_bin
[PATCH v4 7/9] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces
Add workingset reporting documentation for better discoverability of its sysfs and memcg interfaces. Also document the required kernel config to enable workingset reporting. Signed-off-by: Yuanchu Xie --- Documentation/admin-guide/mm/index.rst| 1 + .../admin-guide/mm/workingset_report.rst | 105 ++ 2 files changed, 106 insertions(+) create mode 100644 Documentation/admin-guide/mm/workingset_report.rst diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..61a2a347fc91 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -41,4 +41,5 @@ the Linux memory management. swap_numa transhuge userfaultfd + workingset_report zswap diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst new file mode 100644 index ..0969513705c4 --- /dev/null +++ b/Documentation/admin-guide/mm/workingset_report.rst @@ -0,0 +1,105 @@ +.. SPDX-License-Identifier: GPL-2.0 + += +Workingset Report += +Workingset report provides a view of memory coldness in user-defined +time intervals, e.g. X bytes are Y milliseconds cold. It breaks down +the user pages in the system per-NUMA node, per-memcg, for both +anonymous and file pages into histograms that look like: +:: + +1000 anon=137368 file=24530 +2 anon=34342 file=0 +3 anon=353232 file=333608 +4 anon=407198 file=206052 +9223372036854775807 anon=4925624 file=892892 + +The workingset reports can be used to drive proactive reclaim, by +identifying the number of cold bytes in a memcg, then writing to +``memory.reclaim``. + +Quick start +=== +Build the kernel with the following configurations. The report relies +on Multi-gen LRU for page coldness. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` +* ``CONFIG_WORKINGSET_REPORT=y`` + +Optionally, the aging kernel daemon can be enabled with the following +configuration. +* ``CONFIG_WORKINGSET_REPORT_AGING=y`` + +Sysfs interfaces + +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides +a per-node page age histogram, showing an aggregate of the node's lruvecs. +Reading this file causes a hierarchical aging of all lruvecs, scanning +pages and creates a new Multi-gen LRU generation in each lruvec. +For example: +:: + +1000 anon=0 file=0 +2000 anon=0 file=0 +10 anon=5533696 file=5566464 +18446744073709551615 anon=0 file=0 + +``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals`` +is a comma-separated list of time in milliseconds that configures what +the page age histogram uses for aggregation. For the above histogram, +the intervals are:: + +1000,2000,10 + +``/sys/devices/system/node/nodeX/workingset_report/refresh_interval`` +defines the amount of time the report is valid for in milliseconds. +When a report is still valid, reading the ``page_age`` file shows +the existing valid report, instead of generating a new one. + +``/sys/devices/system/node/nodeX/workingset_report/report_threshold`` +specifies how often the userspace agent can be notified for node +memory pressure, in milliseconds. When a node reaches its low +watermarks and wakes up kswapd, programs waiting on ``page_age`` are +woken up so they can read the histogram and make policy decisions. + +Memcg interface +=== +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``, +``refresh_interval`` and ``report_threshold`` are available per-memcg. + +``/sys/fs/cgroup/.../memory.workingset.page_age`` +The memcg equivalent of the sysfs workingset page age histogram +breaks down the workingset of this memcg and its children into +page age intervals. Each node is prefixed with a node header and +a newline. Non-proactive direct reclaim on this memcg can also +wake up userspace agents that are waiting on this file. +E.g. +:: + +N0 +1000 anon=0 file=0 +2000 anon=0 file=0 +3000 anon=0 file=0 +4000 anon=0 file=0 +5000 anon=0 file=0 +18446744073709551615 anon=0 file=0 + +``/sys/fs/cgroup/.../memory.workingset.refresh_interval`` +The memcg equivalent of the sysfs refresh interval. A per-node +number of how much time a page age histogram is valid for, in +milliseconds. +E.g. +:: + +echo N0=2000 > memory.workingset.refresh_interval + +``/sys/fs/cgroup/.../memory.workingset.report_threshold`` +The memcg equivalent of the sysfs report threshold. A per-node +number of how often userspace agent waiting on the page age +histogram can be woken up, in milliseconds. +E.g. +:: + +echo N0=1000 > memory.workingset.report_threshold -- 2.47.0.338.g60cca15819-goog
[PATCH v4 8/9] Docs/admin-guide/cgroup-v2: document workingset reporting
Add workingset reporting documentation for better discoverability of its memcg interfaces. Point the memcg documentation to Documentation/admin-guide/mm/workingset_report.rst for more details. Signed-off-by: Yuanchu Xie --- Documentation/admin-guide/cgroup-v2.rst | 35 + 1 file changed, 35 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 2cb58daf3089..67a183f08245 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1784,6 +1784,41 @@ The following nested keys are defined. Shows pressure stall information for memory. See :ref:`Documentation/accounting/psi.rst ` for details. + memory.workingset.page_age + A read-only histogram which exists on non-root cgroups. + + This breaks down the cgroup's memory footprint into different + types of memory and groups them per-node into user-defined coldness + bins. + + The output format of memory.workingset.page_age is:: + + N0 + type= + type= + ... + 18446744073709551615 type= + + The type of memory can be anon, file, or new types added later. + Don't rely on the types remaining fixed. See + :ref:`Documentation/admin-guide/mm/workingset_report.rst ` + for details. + + memory.workingset.refresh_interval + A read-write nested-keyed file which exists on non-root cgroups. + + Setting it to a non-zero value for any node enables working set + reporting for that node. The default is 0 for each node. See + :ref:`Documentation/admin-guide/mm/workingset_report.rst ` + for details. + + memory.workingset.report_threshold + A read-write nested-keyed file which exists on non-root cgroups. + + The amount of milliseconds to wait before reporting the working + set again. The default is 0 for each node. See + :ref:`Documentation/admin-guide/mm/workingset_report.rst ` + for details. Usage Guidelines -- 2.47.0.338.g60cca15819-goog
[PATCH v4 0/9] mm: workingset reporting
This patch series provides workingset reporting of user pages in lruvecs, of which coldness can be tracked by accessed bits and fd references. However, the concept of workingset applies generically to all types of memory, which could be kernel slab caches, discardable userspace caches (databases), or CXL.mem. Therefore, data sources might come from slab shrinkers, device drivers, or the userspace. Another interesting idea might be hugepage workingset, so that we can measure the proportion of hugepages backing cold memory. However, with architectures like arm, there may be too many hugepage sizes leading to a combinatorial explosion when exporting stats to the userspace. Nonetheless, the kernel should provide a set of workingset interfaces that is generic enough to accommodate the various use cases, and extensible to potential future use cases. Use cases == Job scheduling On overcommitted hosts, workingset information improves efficiency and reliability by allowing the job scheduler to have better stats on the exact memory requirements of each job. This can manifest in efficiency by landing more jobs on the same host or NUMA node. On the other hand, the job scheduler can also ensure each node has a sufficient amount of memory and does not enter direct reclaim or the kernel OOM path. With workingset information and job priority, the userspace OOM killing or proactive reclaim policy can kick in before the system is under memory pressure. If the job shape is very different from the machine shape, knowing the workingset per-node can also help inform page allocation policies. Proactive reclaim Workingset information allows the a container manager to proactively reclaim memory while not impacting a job's performance. While PSI may provide a reactive measure of when a proactive reclaim has reclaimed too much, workingset reporting allows the policy to be more accurate and flexible. Ballooning (similar to proactive reclaim) The last patch of the series extends the virtio-balloon device to report the guest workingset. Balloon policies benefit from workingset to more precisely determine the size of the memory balloon. On end-user devices where memory is scarce and overcommitted, the balloon sizing in multiple VMs running on the same device can be orchestrated with workingset reports from each one. On the server side, workingset reporting allows the balloon controller to inflate the balloon without causing too much file cache to be reclaimed in the guest. Promotion/Demotion If different mechanisms are used for promition and demotion, workingset information can help connect the two and avoid pages being migrated back and forth. For example, given a promotion hot page threshold defined in reaccess distance of N seconds (promote pages accessed more often than every N seconds). The threshold N should be set so that ~80% (e.g.) of pages on the fast memory node passes the threshold. This calculation can be done with workingset reports. To be directly useful for promotion policies, the workingset report interfaces need to be extended to report hotness and gather hotness information from the devices[1]. [1] https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 Sysfs and Cgroup Interfaces == The interfaces are detailed in the patches that introduce them. The main idea here is we break down the workingset per-node per-memcg into time intervals (ms), e.g. 1000 anon=137368 file=24530 2 anon=34342 file=0 3 anon=353232 file=333608 4 anon=407198 file=206052 9223372036854775807 anon=4925624 file=892892 Implementation == The reporting of user pages is based off of MGLRU, and therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more fine-grained workingset report, but we can already gather a lot of data with just four generations. The workingset reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind CONFIG_WORKINGSET_REPORT_AGING. Benchmarks == Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux compile and redis benchmarks from openbenchmarking.org. The policy and runner is referred to as WMO (Workload Memory Optimization). The results were based on v3 of the series, but v4 doesn't change the core of the working set reporting and just adds the ballooning counterpart. The timed Linux kernel compilation benchmark shows improvements in peak memory usage with a policy of "swap out all bytes colder than 10 seconds every 40 seconds". A swapfile is configured on SSD. peak memory usage (with WMO): 4982.61328 MiB peak memory usage (control): 9569.1367 MiB peak memory reduction: 47.9% Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev Timed Linux Kernel Compilation - allmodconfig (sec) | 708.48
Re: [PATCH v6 00/15] integrity: Introduce the Integrity Digest Cache
On Tue, 2024-11-26 at 00:13 +, Eric Snowberg wrote: > > > On Nov 19, 2024, at 3:49 AM, Roberto Sassu > > wrote: > > > > From: Roberto Sassu > > > > The Integrity Digest Cache can also help IMA for appraisal. IMA can simply > > lookup the calculated digest of an accessed file in the list of digests > > extracted from package headers, after verifying the header signature. It is > > sufficient to verify only one signature for all files in the package, as > > opposed to verifying a signature for each file. > > Is there a way to maintain integrity over time? Today if a CVE is discovered > in a signed program, the program hash can be added to the blacklist keyring. > Later if IMA appraisal is used, the signature validation will fail just for > that > program. With the Integrity Digest Cache, is there a way to do this? As far as I can see, the ima_check_blacklist() call is before ima_appraise_measurement(). If it fails, appraisal with the Integrity Digest Cache will not be done. In the future, we might use the Integrity Digest Cache for blacklists too. Since a digest cache is reset on a file/directory change, IMA would have to revalidate the program digest against a new digest cache. Thanks Roberto
Re: [PATCH v6 07/15] digest_cache: Allow registration of digest list parsers
On Mon, 2024-11-25 at 15:53 -0800, Luis Chamberlain wrote: > On Tue, Nov 19, 2024 at 11:49:14AM +0100, Roberto Sassu wrote: > > From: Roberto Sassu > > Introduce load_parser() to load a kernel module containing a > > parser for the requested digest list format (compressed kernel modules are > > supported). Kernel modules are searched in the > > /lib/modules//security/integrity/digest_cache directory. > > > > load_parser() calls ksys_finit_module() to load a kernel module directly > > from the kernel. request_module() cannot be used at this point, since the > > reference digests of modprobe and the linked libraries (required for IMA > > appraisal) might not be yet available, resulting in modprobe execution > > being denied. > > You are doing a full solution implementation of loading modules in-kernel. > Appraisals of modules is just part of the boot process, some module > loading may need firmware to loading to get some functinality to work > for example some firmware to get a network device up or a GPU driver. > So module loading alone is not the only thing which may require > IMA appraisal, and this solution only addresses modules. There are other > things which may be needed other than firmware, eBPF programs are > another example. Firmware, eBPF programs and so on are supposed to be verified with digest lists (or alternative methods, such as file signatures), once the required digest list parsers are loaded. The parser is an exceptional case, because user space cannot be executed at this point. Once the parsers are loaded, verification of everything else proceeds as normal. Fortunately, in most cases kernel modules are signed, so digest lists are not required to verify them. > It sounds more like you want to provide or extend LSM hooks fit your > architecture and make kernel_read_file() LSM hooks optionally use it to > fit this model. As far as the LSM infrastructure is concerned, I'm not adding new LSM hooks, nor extending/modifying the existing ones. The operations the Integrity Digest Cache is doing match the usage expectation by LSM (net denying access, as discussed with Paul Moore). The Integrity Digest Cache is supposed to be used as a supporting tool for other LSMs to do regular access control based on file data and metadata integrity. In doing that, it still needs the LSM infrastructure to notify about filesystem changes, and to store additional information in the inode and file descriptor security blobs. The kernel_post_read_file LSM hook should be implemented by another LSM to verify the integrity of a digest list, when the Integrity Digest Cache calls kernel_read_file() to read that digest list. That LSM is also responsible to provide the result of the integrity verification to the Integrity Digest Cache, so that the latter can give this information back to whoever wants to do a digest lookup from that digest list and also wants to know whether or not the digest list was authentic. > Because this is just for a *phase* in boot, which you've caught because > a catch-22 situaton, where you didn't have your parsers loaded. Which is > just a reflection that you hit that snag. It doesn't prove all snags > will be caught yet. Yes, that didn't happen earlier, since all the parsers were compiled built-in in the kernel. The Integrity Digest Cache already has a deadlock avoidance mechanism for digest lists. Supporting kernel modules opened the road for new deadlocks, since one can ask a digest list to verify a kernel module, but that digest list requires the same kernel module. That is why the in-kernel mechanism is 100% reliable, because the Integrity Digest Cache marks the file descriptors it opens, and can recognize them, when those file descriptors are passed back to it by other LSMs (e.g. through the kernel_post_read_file LSM hook). > And you only want to rely on this .. in-kernel loading solution only > early on boot, is there a way to change this over to enable regular > operation later? User space can voluntarily load new digest list parsers, but the Integrity Digest Cache cannot rely on it to be done. Also, using request_module() does not seem a good idea, since it wouldn't allow the Integrity Digest Cache to mark the file descriptor of kernel modules, and thus the Integrity Digest Cache could not determine whether or not a deadlock is happening. Thanks Roberto