[Background] ================ Currently, reading files with different paths (or names) but the same content will consume multiple copies of the page cache, even if the content of these page caches is the same. For example, reading identical files (e.g., *.so files) from two different minor versions of container images will cost multiple copies of the same page cache, since different containers have different mount points. Therefore, sharing the page cache for files with the same content can save memory.
[Implementation] ================ During the mkfs phase, file content is hashed and the hash value is stored in the `user.fingerprint` extended attribute. Inodes of files with the same `user.fingerprint` are mapped to an anonymous inode, whose page cache stores the actual contents. In this way, a single copy of the anonymous inode's page cache can serve read requests from several files mapped to it. The following describes the relationship between the anonymous inode and inodes with the same content: page cache ┌────┬────┬────┬──────┐ ┌────►│ │ │ ...│ │ │ └────┴────┴────┴──────┘ │ │ │ i_private ┌──┴────────┬───┐ ┌─►│ ano_inode │ │ │ └───────────┴─┬─┘ │ │ │ ┌────────┘ mapped │ ▼ to │ ┌──────────┬───┬─────┐ │ │erofs_pcs │cur│ list│ │ └──────────┴─┬─┴───┬─┘ │ │ │ │ ┌─────────┘ │ │ │ │ │ │ ┌──────────┘ │ │ │ │ ▼ ▼ │ ┌────────┐ ┌────────┐ ┌────────┐ └──┤ │ ────► │ │ ───► ──► │ │ │ │ │ │ ... │ │ └────────┘ ◄──── └────────┘ ◄─── ◄── └────────┘ inode_1 inode_2 inode_n In the above diagram, the `i_private` (protected by `i_lock`) field of the anonymous inode points to the `struct erofs_pcs` structure: struct erofs_pcs { struct erofs_inode *cur; struct rw_semaphore rw_sem; struct mutex list_mutex; struct list_head list; }; where the `list` field points to a list of inodes that are mapped to the anonymous inode and has the same `user.fingerprint` field. The `cur` field points to the first inode in the inode list, which is used for I/O mapping (iomap) related operations. When an inode is created, it is added to the inode list pointed to by the `erofs_pcs` structure corresponding to the anonymous inode; similarly, when the inode is destroyed, it is removed from the inode list. Note that if the inode is the one pointed to by `cur`, then it is necessary to acquire the read-write semaphore `rw_sem` to maintain synchronization, in case the inode is being used for iomap operations elsewhere. [Effect] ================ I conducted experiments on two aspects across two different minor versions of container images: 1. reading all files in two different minor versions of container images 2. run workloads or use the default entrypoint within the containers^[1] Below is the memory usage for reading all files in two different minor versions of container images: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 241 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 163 | 33% | +-------------------+------------------+-------------+---------------+ | | No | 872 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 630 | 28% | +-------------------+------------------+-------------+---------------+ | | No | 2771 | - | | tensorflow +------------------+-------------+---------------+ | 1.11.0 & 2.11.1 | Yes | 2340 | 16% | +-------------------+------------------+-------------+---------------+ | | No | 926 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 735 | 21% | +-------------------+------------------+-------------+---------------+ | | No | 390 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 219 | 44% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 924 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 474 | 49% | +-------------------+------------------+-------------+---------------+ Additionally, the table below shows the runtime memory usage of the container: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 34.9 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 33.6 | 4% | +-------------------+------------------+-------------+---------------+ | | No | 149.1 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 95 | 37% | +-------------------+------------------+-------------+---------------+ | | No | 1027.9 | - | | tensorflow +------------------+-------------+---------------+ | 1.11.0 & 2.11.1 | Yes | 934.3 | 10% | +-------------------+------------------+-------------+---------------+ | | No | 155.0 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 139.1 | 11% | +-------------------+------------------+-------------+---------------+ | | No | 25.4 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 18.8 | 26% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 186 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 99 | 47% | +-------------------+------------------+-------------+---------------+ It can be observed that when reading all the files in the image, the reduced memory usage varies from 16% to 49%, depending on the specific image. Additionally, the container's runtime memory usage reduction ranges from 4% to 47%. [1] Below are the workload for these images: - redis: redis-benchmark - postgres: sysbench - tensorflow: app.py of tensorflow.python.platform - mysql: sysbench - nginx: wrk - tomcat: default entrypoint Hongzhen Luo (4): erofs: move `struct erofs_anon_fs_type` to super.c erofs: expose erofs_iomap_{begin, end} erofs: introduce page cache share feature erofs: apply the page cache share feature fs/erofs/Kconfig | 10 ++ fs/erofs/Makefile | 1 + fs/erofs/data.c | 9 +- fs/erofs/fscache.c | 13 +- fs/erofs/inode.c | 17 ++ fs/erofs/internal.h | 8 + fs/erofs/pagecache_share.c | 318 +++++++++++++++++++++++++++++++++++++ fs/erofs/pagecache_share.h | 23 +++ fs/erofs/super.c | 40 +++++ 9 files changed, 425 insertions(+), 14 deletions(-) create mode 100644 fs/erofs/pagecache_share.c create mode 100644 fs/erofs/pagecache_share.h -- 2.43.5