Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices

Dongsheng Yang Sun, 22 Jun 2025 20:13:28 -0700

Hi Mikulas:

     I will send dm-pcache V1 soon, below is my response to your comments.


在 6/13/2025 12:57 AM, Mikulas Patocka 写道:

Hi


On Thu, 5 Jun 2025, Dongsheng Yang wrote:

Hi Mikulas and all,

This is *RFC v2* of the *pcache* series, a persistent-memory backed cache.
Compared with *RFC v1*

<https://lore.kernel.org/lkml/20250414014505.20477-1-dongsheng.y...@linux.dev/>the most important change is that the whole cache has been *ported to

the Device-Mapper framework* and is now exposed as a regular DM target.

Code:
     https://github.com/DataTravelGuide/linux/tree/dm-pcache

Full RFC v2 test results:
     
https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/results.html

     All 962 xfstests cases passed successfully under four different
pcache configurations.

     One of the detailed xfstests run:
         
https://datatravelguide.github.io/dtg-blog/pcache/pcache_rfc_v2_result/test-results/02-._pcache.py_PcacheTest.test_run-crc-enable-gc-gc0-test_script-xfstests-a515/debug.log

Below is a quick tour through the three layers of the implementation,
followed by an example invocation.

----------------------------------------------------------------------
1. pmem access layer
----------------------------------------------------------------------

* All reads use *copy_mc_to_kernel()* so that uncorrectable media
   errors are detected and reported.
* All writes go through *memcpy_flushcache()* to guarantee durability
   on real persistent memory.

You could also try to use normal write and clflushopt for big writes - I
found out that for larger regions it is better - see the function
memcpy_flushcache_optimized in dm-writecache. Test, which way is better.


I did a test with fio on /dev/pmem0, with an attached patch on nd_pmem.ko:

when I use memmap pmem device, I got a similar result with the commentin memcpy_flushcache_optimized():

Test (memmap pmem) clflushopt flushcache------------------------------------------------- test_randwrite_512 200MiB/s 228 MiB/s test_randwrite_1024 378 MiB/s 431 MiB/stest_randwrite_2K 773 MiB/s 769 MiB/s test_randwrite_4K 1364 MiB/s 1272MiB/s test_randwrite_8K 2078 MiB/s 1817 MiB/s test_randwrite_16K 2745MiB/s 2098 MiB/s test_randwrite_32K 3232 MiB/s 2231 MiB/stest_randwrite_64K 3660 MiB/s 2411 MiB/s test_randwrite_128K 3922 MiB/s2513 MiB/s test_randwrite_1M 3824 MiB/s 2537 MiB/s test_write_512 228MiB/s 228 MiB/s test_write_1024 439 MiB/s 423 MiB/s test_write_2K 841MiB/s 800 MiB/s test_write_4K 1364 MiB/s 1308 MiB/s test_write_8K 2107MiB/s 1838 MiB/s test_write_16K 2752 MiB/s 2166 MiB/s test_write_32K3213 MiB/s 2247 MiB/s test_write_64K 3661 MiB/s 2415 MiB/stest_write_128K 3902 MiB/s 2514 MiB/s test_write_1M 3808 MiB/s 2529 MiB/s


But I got a different result when I use Optane pmem100:

Test (Optane pmem100) clflushopt flushcache------------------------------------------------- test_randwrite_512 167MiB/s 226 MiB/s test_randwrite_1024 301 MiB/s 420 MiB/stest_randwrite_2K 615 MiB/s 639 MiB/s test_randwrite_4K 967 MiB/s 1024MiB/s test_randwrite_8K 1047 MiB/s 1314 MiB/s test_randwrite_16K 1096MiB/s 1377 MiB/s test_randwrite_32K 1155 MiB/s 1382 MiB/stest_randwrite_64K 1184 MiB/s 1452 MiB/s test_randwrite_128K 1199 MiB/s1488 MiB/s test_randwrite_1M 1178 MiB/s 1499 MiB/s test_write_512 233MiB/s 233 MiB/s test_write_1024 424 MiB/s 391 MiB/s test_write_2K 706MiB/s 760 MiB/s test_write_4K 978 MiB/s 1076 MiB/s test_write_8K 1059MiB/s 1296 MiB/s test_write_16K 1119 MiB/s 1380 MiB/s test_write_32K1158 MiB/s 1387 MiB/s test_write_64K 1184 MiB/s 1448 MiB/stest_write_128K 1198 MiB/s 1481 MiB/s test_write_1M 1178 MiB/s 1486 MiB/s

So for now I’d rather keep using flushcache in pcache. In future, oncewe’ve come up with a general-purpose optimization, we can switch to that.

----------------------------------------------------------------------
2. cache-logic layer (segments / keys / workers)
----------------------------------------------------------------------

Main features
   - 16 MiB pmem segments, log-structured allocation.
   - Multi-subtree RB-tree index for high parallelism.
   - Optional per-entry *CRC32* on cached data.

Would it be better to use crc32c because it has hardware support in the
SSE4.2 instruction set?


Good suggestion, thanx.

   - Background *write-back* worker and watermark-driven *GC*.
   - Crash-safe replay: key-sets are scanned from *key_tail* on start-up.

Current limitations
   - Only *write-back* mode implemented.
   - Only FIFO cache invalidate; other (LRU, ARC...) planned.

----------------------------------------------------------------------
3. dm-pcache target integration
----------------------------------------------------------------------

* Table line
     `pcache <pmem_dev> <origin_dev> writeback <true|false>`
* Features advertised to DM:
   - `ti->flush_supported = true`, so *PREFLUSH* and *FUA* are honoured
     (they force all open key-sets to close and data to be durable).
* Not yet supported:
   - Discard / TRIM.
   - dynamic `dmsetup reload`.

If you don't support it, you should at least try to detect that the user
did reload and return error - so that there won't be data corruption in
this case.


Yes, actually It did return -EOPNOTSUPP。

static int dm_pcache_ctr(struct dm_target *ti, unsigned int argc, char**argv)

{
        struct mapped_device *md = ti->table->md;
        struct dm_pcache *pcache;
        int ret;

        if (md->map) {
                ti->error = "Don't support table loading for live md";
                return -EOPNOTSUPP;
        }


But it would be better to support table reload. You can support it by a
similar mechanism to "__handover_exceptions" in the dm-snap.c driver.

Of course, that's planned but we think that's not necessary in initialversion of it.

Runtime controls
   - `dmsetup message <dev> 0 gc_percent <0-90>` adjusts the GC trigger.

Status line reports super-block flags, segment counts, GC threshold and
the three tail/head pointers (see the RST document for details).

Perhaps these are not real bugs (I didn't analyze it thoroughly), but
there are some GFP_NOWAIT and GFP_KERNEL allocations.

GFP_NOWAIT can fail anytime (for example, if the machine receives too many
network packets), so you must handle the error gracefully.

GFP_KERNEL allocation may recurse back into the I/O path through swapping
or file writeback, thus they may cause deadlocks. You should use
GFP_KERNEL in the target constructor or destructor because there is no I/O
to be processed in this time, but they shouldn't be used in the I/O
processing path.



**Yes**, I'm using the approach I described above.

- **GFP_KERNEL** is only used on the constructor path.

- **GFP_NOWAIT** is used in two cases: one for allocating `cache_key`,and another for allocating `backing_dev_req` on a cache miss.

Both of these NOWAIT allocations happen while traversing the`cache_subtree` under the `subtree_lock` spinlock—i.e., in contextswhere scheduling isn't allowed. That’s why we use `GFP_NOWAIT`.


I see that when you get ENOMEM, you retry the request in 100ms - putting
arbitrary waits in the code is generally bad practice - this won't work if
the user is swapping to the dm-pcache device. It may be possible that
there is no memory free, thus retrying won't help and it will deadlock.

In **V1**, I removed the retry-on-ENOMEM mechanism to make pcache morecoherent. Now it only retries when the cache is actually full.


- When the cache is full, the `pcache_req` is added to a `defer_list`.

- When cache invalidation occurs, we kick off all deferred requests toretry.


This eliminates arbitrary waits.


I suggest to use mempools to guarantee forward progress in out-of-memory
situation. A mempool_alloc(GFP_IO) will never return NULL, it will just
wait until some other process frees some entry into the mempool.

Both `cache_key` and `backing_dev_req` now use **mempools**, but—asexplained—they may still be allocated under a spinlock, so they use**GFP_NOWAIT**.


Generally, a convention among device mapper targets is that the have a few
fixed parameters first, then there is a number of optional parameters and
then there are optional parameters (either in "parameter:123" or
"parameter 123" format). You should follow this convention, so that it can
be easily extended with new parameters later.

Thanks, that's a good suggestion.

In **V1**, we changed the parameter format to:

pcache <cache_dev> <backing_dev> [<number_of_optional_arguments><cache_mode writeback> <data_crc true|false>]


The __packed attribute causes performance degradation on risc machines
without hardware support for unaligned accesses - the compiled will
generate byte-by-byte accesses - I suggest to not use it and instead make
sure that the members in the structures are naturally aligned (and
inserting explicit padding if needed).



Thanks — we removed `__packed` in V1.


The function "memcpy_flushcache" in arch/x86/include/asm/string_64.h is
optimized for 4, 8 and 16-byte accesess (because that's what dm-writecache
uses) - I suggest to add more optimizations to it for constant sizes that
fit the usage pattern of dm-pcache,

Thanks again — as mentioned above, we plan to optimize later, possiblywith a more generic solution.


I see that you are using "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, 0);" and "queue_delayed_work(cache_get_wq(cache),
&cache->writeback_work, delay);" - the problem here is that if the entry
is already queued with a delay and you attempt to queue it again with zero
again, this new queue attempt will be ignored - I'm not sure if this is
intended behavior or not.


Yes, in simple terms:
1. If after writeback completes the cache is clean, we should delay.

2. If it's not clean, set delay = 0 so the next writeback cycle canbegin immediately.

In theory these are two separate branches, even if a zero delay might beignored. Since writeback work isn't highly time-sensitive, I believethis is harmless.


req_complete_fn: this will never run with interrupts disabled, so you can
replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq.


thanx  Fixed in V1


backing_dev_bio_end: there's a bug in this function - it may be called
both with interrupts disabled and interrupts enabled, so you should not
use spin_lock/spin_unlock. You should be called
spin_lock_irqsave/spin_unlock_irqrestore.


Good catch，it's fixed in V1


queue_work(BACKING_DEV_TO_PCACHE - i would move it inside the spinlock -
see the commit 829451beaed6165eb11d7a9fb4e28eb17f489980 for a similar
problem.


Thanx, Fixed in V1


bio_map - bio vectors can hold arbitrarily long entries - if the "base"
variable is not from vmalloc, you can just add it one bvec entry.

Good idea — in V1 we introduced `inline_bvecs`, and when `base` isn’tvmalloc, we use a single bvec entry.

"backing_req->kmem.bvecs = kcalloc" - you can use kmalloc_array instead of
kcalloc, there's no need to zero the value.

Fixed in V1

+                if (++wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT)
+                        return NULL;
+
+                udelay(PCACHE_WAIT_NEW_CACHE_INTERVAL);
+                goto again;

This is not good practice to insert arbitrary waits (here, the wait is
burning CPU power, which makes it even worse). You should add the process
to a wait queue and wake up the queue.

See the functions writecache_wait_on_freelist and writecache_free_entry
for an example, how to wait correctly.

`get_cache_segment()` may be called under a spinlock, so I originallydesigned a “short-busy-wait” to wait for a free segment,

If wait_count >= PCACHE_WAIT_NEW_CACHE_COUNT, then return -EBUSY to letdm-pcache defer_list to retry.

However, I later discovered that when the cache is full, there's rarelyenough time during that brief short-busy-wait to free a segment.

Therefore in V1 I removed it and instead return `-EBUSY`, allowingdm-pcache’s retry mechanism to wait for a cache invalidation beforeretrying.

+static int dm_pcache_map_bio(struct dm_target *ti, struct bio *bio)
+{
+     struct pcache_request *pcache_req = dm_per_bio_data(bio, sizeof(struct 
pcache_request));
+     struct dm_pcache *pcache = ti->private;
+     int ret;
+
+     pcache_req->pcache = pcache;
+     kref_init(&pcache_req->ref);
+     pcache_req->ret = 0;
+     pcache_req->bio = bio;
+     pcache_req->off = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+     pcache_req->data_len = bio->bi_iter.bi_size;
+     INIT_LIST_HEAD(&pcache_req->list_node);
+     bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);

This looks suspicious because you store the original bi_sector to
pcache_req->off and then subtract the target offset from it. Shouldn't
"bio->bi_iter.bi_sector = dm_target_offset(ti, bio->bi_iter.bi_sector);"
be before "pcache_req->off = (u64)bio->bi_iter.bi_sector <<
SECTOR_SHIFT;"?



Yes, that logic is indeed questionable, but it works in testing.

Since we define dm-pcache as a **singleton**, both behaviors shouldeffectively be equivalent, IIUC. Also, in V1 I moved the call to`dm_target_offset()` so it runs before setting up `pcache_req->off`,making the code logic correct.


Thanx

Dongsheng


Generally, the code doesn't seem bad. After reworking the out-of-memory
handling and replacing arbitrary waits with wait queues, I can merge it.

Mikulas

From 88476f95801f7177c3a8c30c663cdd788f73a3b5 Mon Sep 17 00:00:00 2001
From: Dongsheng Yang <dongsheng.y...@linux.dev>
Date: Mon, 23 Jun 2025 10:10:37 +0800
Subject: [PATCH] memcpy_flushcache_optimized on nd_pmem.ko

Signed-off-by: Dongsheng Yang <dongsheng.y...@linux.dev>
---
 drivers/nvdimm/pmem.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index aa50006b7616..78c2e8481630 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -122,6 +122,42 @@ static blk_status_t pmem_clear_poison(struct pmem_device 
*pmem,
        return BLK_STS_OK;
 }
 
+static void memcpy_flushcache_optimized(void *dest, void *source, size_t size)
+{
+       /*
+        * clflushopt performs better with block size 1024, 2048, 4096
+        * non-temporal stores perform better with block size 512
+        *
+        * block size   512             1024            2048            4096
+        * movnti       496 MB/s        642 MB/s        725 MB/s        744 MB/s
+        * clflushopt   373 MB/s        688 MB/s        1.1 GB/s        1.2 GB/s
+        *
+        * We see that movnti performs better for 512-byte blocks, and
+        * clflushopt performs better for 1024-byte and larger blocks. So, we
+        * prefer clflushopt for sizes >= 768.
+        *
+        * NOTE: this happens to be the case now (with dm-writecache's single
+        * threaded model) but re-evaluate this once memcpy_flushcache() is
+        * enabled to use movdir64b which might invalidate this performance
+        * advantage seen with cache-allocating-writes plus flushing.
+        */
+#ifdef CONFIG_X86
+       if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) &&
+           likely(boot_cpu_data.x86_clflush_size == 64) &&
+           likely(size >= 0)) {
+               do {
+                       memcpy((void *)dest, (void *)source, 64);
+                       clflushopt((void *)dest);
+                       dest += 64;
+                       source += 64;
+                       size -= 64;
+               } while (size >= 64);
+               return;
+       }
+#endif
+       memcpy_flushcache(dest, source, size);
+}
+
 static void write_pmem(void *pmem_addr, struct page *page,
                unsigned int off, unsigned int len)
 {
@@ -131,7 +167,8 @@ static void write_pmem(void *pmem_addr, struct page *page,
        while (len) {
                mem = kmap_atomic(page);
                chunk = min_t(unsigned int, len, PAGE_SIZE - off);
-               memcpy_flushcache(pmem_addr, mem + off, chunk);
+               //memcpy_flushcache(pmem_addr, mem + off, chunk);
+               memcpy_flushcache_optimized(pmem_addr, mem + off, chunk);
                kunmap_atomic(mem);
                len -= chunk;
                off = 0;
-- 
2.43.0

Re: [RFC v2 00/11] dm-pcache – persistent-memory cache for block devices

Reply via email to