Hi
On Thu, 5 Jun 2025, Dongsheng Yang wrote: > Hi Mikulas and all, > > This is *RFC v2* of the *pcache* series, a persistent-memory backed cache. > Compared with *RFC v1* > <https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.y...@linux.dev/> > > the most important change is that the whole cache has been *ported to > the Device-Mapper framework* and is now exposed as a regular DM target. > > Code: > https://github.com/DataTravelGuide/linux/tree/dm-pcache > > Full RFC v2 test results: > > https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html > > All 962 xfstests cases passed successfully under four different > pcache configurations. > > One of the detailed xfstests run: > > https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log > > Below is a quick tour through the three layers of the implementation, > followed by an example invocation. > > ---------------------------------------------------------------------- > 1. pmem access layer > ---------------------------------------------------------------------- > > * All reads use *copy_mc_to_kernel()* so that uncorrectable media > errors are detected and reported. > * All writes go through *memcpy_flushcache()* to guarantee durability > on real persistent memory. You could also try to use normal write and clflushopt for big writes - I found out that for larger regions it is better - see the function memcpy_flushcache_optimized in dm-writecache. Test, which way is better. > ---------------------------------------------------------------------- > 2. cache-logic layer (segments / keys / workers) > ---------------------------------------------------------------------- > > Main features > - 16 MiB pmem segments, log-structured allocation. > - Multi-subtree RB-tree index for high parallelism. > - Optional per-entry *CRC32* on cached data. Would it be better to use crc32c because it has hardware support in the SSE4.2 instruction set? > - Background *write-back* worker and watermark-driven *GC*. > - Crash-safe replay: key-sets are scanned from *key_tail* on start-up. > > Current limitations > - Only *write-back* mode implemented. > - Only FIFO cache invalidate; other (LRU, ARC...) planned. > > ---------------------------------------------------------------------- > 3. dm-pcache target integration > ---------------------------------------------------------------------- > > * Table line > `pcache <pmem_dev> <origin_dev> writeback <true|false>` > * Features advertised to DM: > - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured > (they force all open key-sets to close and data to be durable). > * Not yet supported: > - Discard / TRIM. > - dynamic `dmsetup reload`. If you don't support it, you should at least try to detect that the user did reload and return error - so that there won't be data corruption in this case. But it would be better to support table reload. You can support it by a similar mechanism to "__handover_exceptions" in the dm-snap.c driver. > Runtime controls > - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger. > > Status line reports super-block flags, segment counts, GC threshold and > the three tail/head pointers (see the RST document for details). Perhaps these are not real bugs (I didn't analyze it thoroughly), but there are some GFP_NOWAIT and GFP_KERNEL allocations. GFP_NOWAIT can fail anytime (for example, if the machine receives too many network packets), so you must handle the error gracefully. GFP_KERNEL allocation may recurse back into the I/O path through swapping or file writeback, thus they may cause deadlocks. You should use GFP_KERNEL in the target constructor or destructor because there is no I/O to be processed in this time, but they shouldn't be used in the I/O processing path. I see that when you get ENOMEM, you retry the request in 100ms - putting arbitrary waits in the code is generally bad practice - this won't work if the user is swapping to the dm-pcache device. It may be possible that there is no memory free, thus retrying won't help and it will deadlock. I suggest to use mempools to guarantee forward progress in out-of-memory situation. A mempool_alloc(GFP_IO) will never return NULL, it will just wait until some other process frees some entry into the mempool. Generally, a convention among device mapper targets is that the have a few fixed parameters first, then there is a number of optional parameters and then there are optional parameters (either in "parameter:123" or "parameter 123" format). You should follow this convention, so that it can be easily extended with new parameters later. The __packed attribute causes performance degradation on risc machines without hardware support for unaligned accesses - the compiled will generate byte-by-byte accesses - I suggest to not use it and instead make sure that the members in the structures are naturally aligned (and inserting explicit padding if needed). The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache uses) - I suggest to add more optimizations to it for constant sizes that fit the usage pattern of dm-pcache, I see that you are using "queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache), &cache->writeback_work, delay);" - the problem here is that if the entry is already queued with a delay and you attempt to queue it again with zero again, this new queue attempt will be ignored - I'm not sure if this is intended behavior or not. req_complete_fn: this will never run with interrupts disabled, so you can replace spin_lock_irqsave/spin_unlock_irqrestore with spin_lock_irq/spin_unlock_irq. backing_dev_bio_end: there's a bug in this function - it may be called both with interrupts disabled and interrupts enabled, so you should not use spin_lock/spin_unlock. You should be called spin_lock_irqsave/spin_unlock_irqrestore. queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock - see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar problem. bio_map - bio vectors can hold arbitrarily long entries - if the "base" variable is not from vmalloc, you can just add it one bvec entry. "backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of kcalloc, there's no need to zero the value. > + if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT) > + return NULL; > + > + udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL); > + goto again; This is not good practice to insert arbitrary waits (here, the wait is burning CPU power, which makes it even worse). You should add the process to a wait queue and wake up the queue. See the functions writecache_wait_on_freelist and writecache_free_entry for an example, how to wait correctly. > +static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio) > +{ > + struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct > pcache_request)); > + struct dm_pcache *pcache = ti->private; > + int ret; > + > + pcache_req->pcache = pcache; > + kref_init(&pcache_req->ref); > + pcache_req->ret = 0; > + pcache_req->bio = bio; > + pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT; > + pcache_req->data_len = bio->bi_iter.bi_size; > + INIT_LIST_HEAD(&pcache_req->list_node); > + bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector); This looks suspicious because you store the original bi_sector to pcache_req->off and then subtract the target offset from it. Shouldn't "bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);" be before "pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;"? Generally, the code doesn't seem bad. After reworking the out-of-memory handling and replacing arbitrary waits with wait queues, I can merge it. Mikulas