Background ============== Currently, reading files with different paths (or names) but the same content can consume multiple copies of the page cache, even if the content of these caches is identical. For example, reading identical files (e.g., *.so files) from two different minor versions of container images can result in multiple copies of the same page cache, since different containers have different mount points. Therefore, sharing the page cache for files with the same content can save memory.
Proposal ============== 1. determining file identity ---------------------------- First, a way needs to be found to check whether the content of two files is the same. Here, the xattr values associated with the file fingerprints are assessed for consistency. When creating the EROFS image, users can specify the name of the xattr for file fingerprints, and the corresponding name will be stored in the packfile. The on-disk `ishare_key_start` indicates the offset of the xattr's name within the packfile: ``` struct erofs_super_block { __le32 xattr_prefix_start; /* start of long xattr prefixes */ __le64 packed_nid; /* nid of the special packed inode */ __u8 xattr_filter_reserved; /* reserved for xattr name filter */ - __u8 reserved2[23]; + __le32 ishare_key_start; /* start of ishare key */ + __u8 reserved2[19]; +}; ``` For example, users can specify the first long prefix as the name for the file fingerprint as follows: ``` mkfs.erofs --ishare_key=trusted.erofs.fingerprint erofs.img ./dir ``` In this way, `trusted.erofs.fingerprint` serves as the name of the xattr for the file fingerprint. The relevant patches for erofs-utils will be released later. At the same time, for security reasons, this patch series only shares files within the same domain, which is achieved by adding "-o domain_id=xxxx" during the mounting process: ``` mount -t erofs -o domain_id=xxx erofs.img /mnt ``` If no domain ID is specified, it will fall back to the non-page cache sharing mode. 2. whose page cache is shared? ------------------------------ 2.1. share the page cache of inode_A or inode_B ----------------------------------------------- For example, we can share the page cache of inode_A, referred to as PGCache_A. When reading file B, we read the contents from PGCache_A to achieve memory savings. Furthermore, if we need to read another file C with the same content, we will still read from PGCache_A. In this way, we fulfill multiple read requests with just a single page cache. 2.2. share the de-duplicated inode's page cache ----------------------------------------------- Unlike in 2.1, we allocate an internal deduplicated inode and use its page cache as shared. Reads for files with identical content will ultimately be routed to the page cache of the deduplicated inode. In this way, a single page cache satisfies multiple read requests for different files with the same contents. 2.3. discussion of the two solutions ----------------------------------------------- Although the solution in 2.1 allows for page cache sharing, it has inherent drawbacks. The creation and destruction of inode nodes in the file system mean that when inode_A is destroyed, PGCache_A is also released. Consequently, if we need to read the file content afterward, we must retrieve the data from the disk again. This conflicts with the design philosophy of page cache (caching contents from the disk). Therefore, I choose to implement the solution in 2.2, which is to allocate an internal deduplicated inode and use its page cache as shared. 3. Implementation ================== 3.1. file open & close ---------------------- When the file is opened, the ->private_data field of file A or file B is set to point to an internal deduplicated file. When the actual read occurs, the page cache of this deduplicated file will be accessed. When the file is opened, if the corresponding erofs inode is newly created, then perform the following actions: 1. add the erofs inode to the backing list of the deduplicated inode; 2. increase the reference count of the deduplicated inode. The purpose of step 1 above is to ensure that when a real I/O operation occurs, the deduplicated inode can locate one of the disk devices (as the deduplicated inode itself is not bound to a specific device). Step 2 is for managing the lifecycle of the deduplicated inode. When the erofs inode is destroyed, the opposite actions mentioned above will be taken. 3.2. file reading ----------------- Assuming the deduplication inode's page cache is PGCache_dedup, there are two possible scenarios when reading a file: 1) the content being read is already present in PGCache_dedup; 2) the content being read is not present in PGCache_dedup. In the second scenario, it involves the iomap operation to read from the disk. 3.2.1. reading existing data in PGCache_dedup ------------------------------------------- In this case, the overall read flowchart is as follows (take ksys_read() for example): ksys_read │ │ ▼ ... │ │ ▼ erofs_ishare_file_read_iter (switch to backing deduplicated file) │ │ ▼ read PGCache_dedup & return At this point, the content in PGCache_dedup will be read directly and returned. 3.2.2 reading non-existent content in PGCache_dedup --------------------------------------------------- In this case, disk I/O operations will be involved. Taking the reading of an uncompressed file as an example, here is the reading process: ksys_read │ │ ▼ ... │ │ ▼ erofs_ishare_file_read_iter (switch to backing deduplicated file) │ │ ▼ ... (allocate pages) │ │ ▼ erofs_read_folio/erofs_readahead │ │ ▼ ... (iomap) │ │ ▼ erofs_iomap_begin │ │ ▼ ... Iomap and the layers below will involve disk I/O operations. As described in 3.1, the deduplicated inode itself is not bound to a specific device. The deduplicated inode will select an erofs inode from the backing list (by default, the first one) to complete the corresponding iomap operation. 3.2.3 optimized inode selection ------------------------------- The inode selection method described in 3.2.2 may select an "inactive" inode. An inactive inode indicates that there may have been no read operations on the inode's device for a long time, and there is a high likelihood that the device may be unmounted. In this case, unmounting the device may experience a slight delay due to other read requests being routed to that device. Therefore, we need to select some "active" inodes for the iomap operation. To achieve optimized inode selection, an additional `processing` list has been added. At the beginning of erofs_{read_folio,readahead}(), the corresponding erofs inode will be added to the `processing` list (because they are active). And it is removed at the end of erofs_{read_folio,readahead}(). In erofs_iomap_begin(), the selected erofs inode's count is incremented, and in erofs_iomap_end(), the count is decremented. In this way, even after the erofs inode is removed from the `processing` list, the increment in the reference count can ensure the integrity of the data reading process. This is somewhat similar to RCU (not exactly the same, but similar). 3.3. release page cache ----------------------- Similar to overlayfs, when dropping the page cache via .fadvise, erofs locates the deduplicated file and applies vfs_fadvise to that specific file. Effect ================== I conducted experiments on two aspects across two different minor versions of container images: 1. reading all files in two different minor versions of container images 2. run workloads or use the default entrypoint within the containers^[1] Below is the memory usage for reading all files in two different minor versions of container images: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 241 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 163 | 33% | +-------------------+------------------+-------------+---------------+ | | No | 872 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 630 | 28% | +-------------------+------------------+-------------+---------------+ | | No | 2771 | - | | tensorflow +------------------+-------------+---------------+ | 2.11.0 & 2.11.1 | Yes | 2340 | 16% | +-------------------+------------------+-------------+---------------+ | | No | 926 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 735 | 21% | +-------------------+------------------+-------------+---------------+ | | No | 390 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 219 | 44% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 924 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 474 | 49% | +-------------------+------------------+-------------+---------------+ Additionally, the table below shows the runtime memory usage of the container: +-------------------+------------------+-------------+---------------+ | Image | Page Cache Share | Memory (MB) | Memory | | | | | Reduction (%) | +-------------------+------------------+-------------+---------------+ | | No | 34.9 | - | | redis +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 33.6 | 4% | +-------------------+------------------+-------------+---------------+ | | No | 149.1 | - | | postgres +------------------+-------------+---------------+ | 16.1 & 16.2 | Yes | 95 | 37% | +-------------------+------------------+-------------+---------------+ | | No | 1027.9 | - | | tensorflow +------------------+-------------+---------------+ | 2.11.0 & 2.11.1 | Yes | 934.3 | 10% | +-------------------+------------------+-------------+---------------+ | | No | 155.0 | - | | mysql +------------------+-------------+---------------+ | 8.0.11 & 8.0.12 | Yes | 139.1 | 11% | +-------------------+------------------+-------------+---------------+ | | No | 25.4 | - | | nginx +------------------+-------------+---------------+ | 7.2.4 & 7.2.5 | Yes | 18.8 | 26% | +-------------------+------------------+-------------+---------------+ | tomcat | No | 186 | - | | 10.1.25 & 10.1.26 +------------------+-------------+---------------+ | | Yes | 99 | 47% | +-------------------+------------------+-------------+---------------+ It can be observed that when reading all the files in the image, the reduced memory usage varies from 16% to 49%, depending on the specific image. Additionally, the container's runtime memory usage reduction ranges from 4% to 47%. [1] Below are the workload for these images: - redis: redis-benchmark - postgres: sysbench - tensorflow: app.py of tensorflow.python.platform - mysql: sysbench - nginx: wrk - tomcat: default entrypoint The patch in this version has made the following changes compared to the previous versionv(patch v5): - support user-defined fingerprint name; - support domain-specific page cache share; - adjusted the code style; - adjustments in code implementation, etc. v5: https://lore.kernel.org/all/20250105151208.3797385-1-hongz...@linux.alibaba.com/ v4: https://lore.kernel.org/all/20240902110620.2202586-1-hongz...@linux.alibaba.com/ v3: https://lore.kernel.org/all/20240828111959.3677011-1-hongz...@linux.alibaba.com/ v2: https://lore.kernel.org/all/20240731080704.678259-1-hongz...@linux.alibaba.com/ v1: https://lore.kernel.org/all/20240722065355.1396365-1-hongz...@linux.alibaba.com/ Hongzhen Luo (7): erofs: move `struct erofs_anon_fs_type` to super.c erofs: support user-defined fingerprint name erofs: support domain-specific page cache share erofs: introduce the page cache share feature erofs: support unencoded inodes for page cache share erofs: support compressed inodes for page cache share erofs: implement .fadvise for page cache share fs/erofs/Kconfig | 10 ++ fs/erofs/Makefile | 1 + fs/erofs/data.c | 82 ++++++++++- fs/erofs/erofs_fs.h | 9 +- fs/erofs/fscache.c | 13 -- fs/erofs/inode.c | 5 + fs/erofs/internal.h | 29 ++++ fs/erofs/ishare.c | 330 ++++++++++++++++++++++++++++++++++++++++++++ fs/erofs/ishare.h | 55 ++++++++ fs/erofs/super.c | 66 ++++++++- fs/erofs/xattr.c | 49 +++++++ fs/erofs/xattr.h | 6 + fs/erofs/zdata.c | 57 ++++++-- 13 files changed, 675 insertions(+), 37 deletions(-) create mode 100644 fs/erofs/ishare.c create mode 100644 fs/erofs/ishare.h -- 2.43.5