[ceph-users] Re: Debugging OSD cache thrashing

Mark Nelson Tue, 24 Jun 2025 15:49:36 -0700

Hi Hector,

Just as a follow-up,

Here are the comments I mentioned in the ceph slack channel from the PRwhere we re-enabled bluestore_buffered_io. I recorded these notes backwhen aclamk and I were digging into the RocksDB code to see if there wasanything we could do to improve the situation. They may be valuable ifyou want to dig into the RocksDB side of what is going on.


https://github.com/ceph/ceph/pull/38044#issuecomment-789192720

https://github.com/ceph/ceph/pull/38044#issuecomment-790157415

In the end we never solved the problem afaik. We still rely onbluestore_buffered_io and the linux page cache to hide excessive diskreads (especially the repeated prefetch reads mentioned in the PR!) whenthe block cache misses. In your case, with very limited memory andpotentially limited page cache, we are likely hitting RocksDBexceptionally hard for metadata lookups and there's no guarantee thatthe page cache itself isn't thrashing as well when the block cachemisses. If that's the case, the bluestore_buffered_io "fix" would notbe able to mitigate the repeated prefetch reads we observed. That maybe related to what you are observing in your setup as well.



Mark

On 6/24/25 08:06, Mark Nelson wrote:

Hi Hector,
Sorry I'm a bit late to the party on this one. I wrote the OSD memoryautotuning code and am probably one of the most recent people toreally dig in and refactor bluestore's caches. I'll respond inlinebelow.
On 6/22/25 05:51, Hector Martin wrote:
Hi all,

I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.

I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.
I suspect that what you are seeing here are RocksDB reads, since weallow those to come from page cache. We have a long standing issuewhere RocksDB was repeatedly re-reading the same blocks for readaheadwhen data isn't in the block cache and buffered IO is the crutchbasically allowing it to perform well. Tyler Stachecki mentioned thebluefs_buffered_io setting and that indeed is directly related to thisissue and why we''ve flip-flopped on whether to leave it enabled. This is a bit of a distraction imho though, see below.
I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).
In bluestore itself we have two primary caches: The onode cache andthe buffer cache. Onode cache stores the metadata about objects whilebuffer cache stores object data. A cache miss in either of these willresult in data being read, but onode cache misses are far moreexpensive because it requires extra work to go fetch the data fromRocksDB (even if the data is stored in RocksDB's block cache!) It mayeven result in extra reads of the SST files if the data can't easilybe found in early RocksDB levels.
To me, what you are seeing sounds like severe onode cache misses,potentially combined with additional read throughput overhead due tocache thrashing during compaction if snaptrimming is resulting in alot of writes or deletes (combined with listings).
I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?
Just to start, the onode cache hit and miss rates in the OSDperformance counter is a good place to start. As Anthony mentioned,things tend not to work very well once you drop below ~1.5-2GB ofmemory. There are some tricks you can play that may help (reducingvarious buffer sizes, pglog lengths, RocksDB WAL buffer sizes andcounts, etc). The honest truth though is that Ceph is incrediblysensitive to onode cache missses, especially if the backend storage isnot very fast.
Here is a perf call trace of the thrashing:
- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)- OSD::dequeue_op(boost::intrusive_ptr<PG>,boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)- 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t,boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&,ECListener&)- 88.78% non-virtual thunk toPrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,std::allocator<ceph::os::Transaction> >&,boost::intrusive_ptr<OpRequest>)-BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,std::vector<ceph::os::Transaction,std::allocator<ceph::os::Transaction> >&, boost::intrusiv- 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*,ceph::os::Transaction*)- 88.71% BlueStore::_remove(BlueStore::TransContext*,boost::intrusive_ptr<BlueStore::Collection>&,boost::intrusive_ptr<BlueStore::Onode>&)- BlueStore::_do_remove(BlueStore::TransContext*,boost::intrusive_ptr<BlueStore::Collection>&,boost::intrusive_ptr<BlueStore::Onode>&)- 88.69% BlueStore::_do_truncate(BlueStore::TransContext*,boost::intrusive_ptr<BlueStore::Collection>&,boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std- 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*,boost::intrusive_ptr<BlueStore::Collection>&,boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ- 88.65%BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)- 88.64% RocksDBStore::get(std::__cxx11::basic_string<char,std::char_traits<char>, std::allocator<char> > const&,std::__cxx11::basic_string<char, std::ch- 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,rocksdb::PinnableSlice*)- rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)- 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&,rocksdb::LookupKey const&, rocksdb::PinnableSlice*,rocksdb::PinnableWideColumns*, std- 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&,rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&,rocksdb::Sl- rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransformconst*, bo- 88.32%rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*,rocksdb::Slice const&, bool, rocksdb::SliceTransfor- rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks- 88.29%rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,rocksdb::GetContext*, rocksdb-rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTableconst*, rocksdb::Fi- rocksdb::Statusrocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,rocks- rocksdb::Statusrocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()
[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptionsconst&, unsigned long, unsigned long, rocksdb::Sli
[ note: at this level the call trace splits into 4, but it all leads to
the same place ]
- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,rocksdb::Slice*, char*) constBlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsignedlong, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user
Note how much of the perf call trace is spent in RocksDB fetchingblocks as a result of trying to load BlueStore shared_blobs viaonodes. Presumably your onode hit ratio is very low, and bluestore istrying to fetch a significant ratio of the onodes from RocksDB, whichitself likely has very little block cache, so each onode read iscoming from RocksDB, which is likely trying to fetch a block fromdisk, putting it in the block cache, but then may be immediatelydiscarding it if there is very little memory for the next block.
One thing that's missing here is a description of the workload that ishappening between snapshots. Are you doing a lot of IO that couldresult in severe fragmentation between snapshots?
Thanks,

Mark
- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to