[ceph-users] Re: Debugging OSD cache thrashing

Mark Nelson Wed, 25 Jun 2025 18:16:25 -0700

Hi Hector,

Responses inline below.


On 6/24/25 10:11 PM, Hector Martin wrote:

Hi Mark,

Thanks a lot for the pointers and info, it's really helpful.

Glad to help, and thanks for looking into it. If we can figure out howto disable bluefs_buffered_io without repercussions, I think it would bea decent win overall.


Since the issue is happening in a live cluster (which is a homelab I can
screw around with to an extent, but not take down for very long periods
of time or lose data in), and since I don't have a lot of spare hours in
the coming weeks, I think this will probably be an investigation on the
longer side of things. So don't expect fast follow-ups, but rest assured
this is now on my radar and I'll be poking at it and seeing what I can
figure out.

Re the workload between snapshots, that's a bit hard to pin down right
now since I did some shuffling around and the snapshot deletions were
from different subtrees. At the time I saw this I had just wiped all my
snapshots in the cluster, ~2 weeks worth of daily snapshots in the
largest subtree, adding up to just over a terabyte of raw cluster
storage being freed going by stats graphs (5+4 EC, 8MB object CephFS,
probably mostly large files and full-size objects). In fact I just took
out the vast majority of that data from the tree that gets snapshotted
(that really shouldn't have been there); if it turns out I can't repro
after that I'm happy to revert it back though (but it would take another
2 weeks to recreate the exact scenario, if it comes to that).

A couple of potential hints here: First, 5+4 erasure coding. That meansyou have 9 chunks per object, each of which will have its own metadatain RocksDB. The larger 8MB CephFS objects may help offset this to someextent, but EC does tend to have higher overhead and especially might ifyou have a lot of small partial writes and snapshots. The second ishaving just wiped away all of your snapshots when you first saw theissue. That would trigger the creation of lots of tombstones in rocksdbthat would only get cleaned up with subsequent compaction. This is justconjecture, but I wonder if you could have ended up with a bunch oftombstones in L0 that had to be read and iterated over if those SSTfiles were evicted from the block cache regularly before a fullcompaction across the other levels took place.


I also just deleted a legacy pool, which at the time was involved in the
snaptrim but had only negligible data actually being snapshotted/deleted
as part of it. Again, if it comes to that to make things repro, I don't
mind recreating it and trying to force the same scenario.

Same as above, just more tombstone insertion into RocksDB untilcompaction takes place.


FWIW, the setup here is 3 nodes:
- M1 Mac Mini (4p4e), 16GB RAM, OSDs + Mon
- M2 Mac Mini (4p4e), 24GB RAM, OSDs + Mon + MDS + monitoring (prom)
- M2 Pro Mac Mini (6p4e), 32GB RAM, OSDs + Mon + MDS + a bunch of other
stuff

Each node has 4x 16TB HDDs and 1x 1TB NVMe SSD, attached via USB (10G,
except 5G for the HDDs on one node because reasons). The HDD block.db
volumes are on internal Mac NVMe storage (recent change, just before the
snapshot story; yes, I did force migrate the data over so there should
be no DB files left on the HDDs). SSDs don't have separate block.db.
10GbE networking for all nodes. I have no reason to suspect any
hardware/driver related issues relevant to the snaptrim story (I've had
a few of those but I keep an eye out for it, and the current setup
should be stable).

No real comment on the HW other than that it is what it is. Thebehaviors are the interesting bit here imho!


One issue is that this may be hard to repro on demand. If I'm lucky I
can just instantly create and delete a pile of snapshots (without much
changes in between) and repro, but I might not be lucky. I'd usually
debug this by increasing logging verbosity and/or adding my own debug
logs, but if this ends up requiring days or weeks of real snapshots to
repro, that's going to get old fast. My best idea right now if it comes
to that is to attach gdb to an OSD currently experiencing the issue, let
the cluster mark it down in the meantime, and hope I can step through
the code and find out what the thrashing situation is before it notices
it got marked down and starts doing something else.

If you'd like, you can try my wallclock profiler. This was inspired bypoor-mans-profiling via gdb. Just be aware that the libdw backend iskind of buggy and sometimes the callgraph ends up with misplaced bits. Libunwind is slower but more reliable:


https://github.com/markhpc/uwpmp

Just attach it to one of your OSD processes and you'll get an asciicallgraph of where time is being spent.


As I mentioned on Slack, one curious thing about my case is that it
seems to have been doing around ~100MB reads from BlueFS (probably
entire sst files?), but they *were* being satisfied from page cache, and
despite that the CPU time was almost all spent copying the data in RAM
in the kernel. So that suggests RocksDB wasn't actually *doing* almost
anything with the data (other than checksumming it, the only other place
I saw CPU usage), not even parsing through it once. It was just reading
it and throwing it away repeatedly, or at least that's what it looks
like to me. So I hope higher debug + gdb will point at where this
behavior comes from, and whether it even makes any sense or it's just a
bug. Same with the block cache not working as intended.

This is consistent with the behavior we've observed during prefetching,though usually a small amount of the data was used while the rest wasthrown away. We hypothesized at the time that perhaps RocksDB isrelying on some characteristic of the posix backend that hides thisbehavior and its being exposed via bluefs in BlueStore, but I don'tthink we ever came to any conclusion there.


I do want to ask whether there are any debug/scaffolding tools to
possibly try to reproduce the RocksDB workload. I can't easily bring up
a full OSD outside the cluster with all the data (it's not impossible,
but a 16TB HDD is kind of a pain to copy around), but I do have block.db
in a separate volume. If there is some way to bring up that BlueFS
stand-alone, and then attempt to recreate the RocksDB workload (e.g. by
logging it from the real OSD), that might lead to a really handy
reproducer. I don't mind writing my own code, just want to know if
there's a good starting place for that or any existing tooling I could
jumpstart from.

I don't think there's anything quite like what you want, though Igor orAdam might have better advice. Ceph-bluestore-tool is probably theclosest. You can however test a ceph objectstore directly via fio:


https://github.com/ceph/ceph/blob/main/src/test/fio/README.md

It might be tough to replicate the kind of behaviors you are observingwith CephFS snapshots using fio alone though. It's still an openquestion of whether or not the snapshots themselves really matter, or ifits just a question of throwing a bunch of metadata at RocksDB on top ofbluefs and then deleting/fetching it aggressively.



Mark


Cheers,
- Hector

On 2025/06/25 7:48, Mark Nelson wrote:

Hi Hector,

Just as a follow-up,

Here are the comments I mentioned in the ceph slack channel from the PR
where we re-enabled bluestore_buffered_io.  I recorded these notes back
when aclamk and I were digging into the RocksDB code to see if there was
anything we could do to improve the situation.  They may be valuable if
you want to dig into the RocksDB side of what is going on.

https://github.com/ceph/ceph/pull/38044#issuecomment-789192720

https://github.com/ceph/ceph/pull/38044#issuecomment-790157415

In the end we never solved the problem afaik.  We still rely on
bluestore_buffered_io and the linux page cache to hide excessive disk
reads (especially the repeated prefetch reads mentioned in the PR!) when
the block cache misses.  In your case, with very limited memory and
potentially limited page cache, we are likely hitting RocksDB
exceptionally hard for metadata lookups and there's no guarantee that
the page cache itself isn't thrashing as well when the block cache
misses.  If that's the case, the bluestore_buffered_io "fix" would not
be able to mitigate the repeated prefetch reads we observed.  That may
be related to what you are observing in your setup as well.


Mark

On 6/24/25 08:06, Mark Nelson wrote:

Hi Hector,

Sorry I'm a bit late to the party on this one.  I wrote the OSD memory
autotuning code and am probably one of the most recent people to
really dig in and refactor bluestore's caches.  I'll respond inline
below.

On 6/22/25 05:51, Hector Martin wrote:

Hi all,

I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.

I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.

I suspect that what you are seeing here are RocksDB reads, since we
allow those to come from page cache.  We have a long standing issue
where RocksDB was repeatedly re-reading the same blocks for readahead
when data isn't in the block cache and buffered IO is the crutch
basically allowing it to perform well.  Tyler Stachecki mentioned the
bluefs_buffered_io setting and that indeed is directly related to this
issue and why we''ve flip-flopped on whether to leave it enabled.
This is a bit of a distraction imho though, see below.

I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).

In bluestore itself we have two primary caches:  The onode cache and
the buffer cache.  Onode cache stores the metadata about objects while
buffer cache stores object data.  A cache miss in either of these will
result in data being read, but onode cache misses are far more
expensive because it requires extra work to go fetch the data from
RocksDB (even if the data is stored in RocksDB's block cache!)  It may
even result in extra reads of the SST files if the data can't easily
be found in early RocksDB levels.

To me, what you are seeing sounds like severe onode cache misses,
potentially combined with additional read throughput overhead due to
cache thrashing during compaction if snaptrimming is resulting in a
lot of writes or deletes (combined with listings).

I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?

Just to start, the onode cache hit and miss rates in the OSD
performance counter is a good place to start.  As Anthony mentioned,
things tend not to work very well once you drop below ~1.5-2GB of
memory.  There are some tricks you can play that may help (reducing
various buffer sizes, pglog lengths, RocksDB WAL buffer sizes and
counts, etc).  The honest truth though is that Ceph is incredibly
sensitive to onode cache missses, especially if the backend storage is
not very fast.

Here is a perf call trace of the thrashing:

- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
- OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
- 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t,
boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&,
ECListener&)
- 88.78% non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&,
boost::intrusive_ptr<OpRequest>)
-
BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&, boost::intrusiv
- 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ceph::os::Transaction*)
- 88.71% BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- 88.69% BlueStore::_do_truncate(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
- 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
- 88.65%
BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
- 88.64% RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::ch
- 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)
- rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
- 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*,
rocksdb::PinnableWideColumns*, std
- 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&,
rocksdb::Sl
- rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform
const*, bo
- 88.32%
rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*,
rocksdb::Slice const&, bool, rocksdb::SliceTransfor
- rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
- 88.29%
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
rocksdb::GetContext*, rocksdb
-
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
const*, rocksdb::Fi
- rocksdb::Status
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
rocks
- rocksdb::Status
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()

[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]

- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions
const&, unsigned long, unsigned long, rocksdb::Sli

[ note: at this level the call trace splits into 4, but it all leads to
the same place ]

- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
rocksdb::Slice*, char*) const
BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
long, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user

Note how much of the perf call trace is spent in RocksDB fetching
blocks as a result of trying to load BlueStore shared_blobs via
onodes.  Presumably your onode hit ratio is very low, and bluestore is
trying to fetch a significant ratio of the onodes from RocksDB, which
itself likely has very little block cache, so each onode read is
coming from RocksDB, which is likely trying to fetch a block from
disk, putting it in the block cache, but then may be immediately
discarding it if there is very little memory for the next block.

One thing that's missing here is a description of the workload that is
happening between snapshots.  Are you doing a lot of IO that could
result in severe fragmentation between snapshots?


Thanks,

Mark

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to