Hi Hector,

Just as a follow-up,

Here are the comments I mentioned in the ceph slack channel from the PR where we re-enabled bluestore_buffered_io.  I recorded these notes back when aclamk and I were digging into the RocksDB code to see if there was anything we could do to improve the situation.  They may be valuable if you want to dig into the RocksDB side of what is going on.

https://github.com/ceph/ceph/pull/38044#issuecomment-789192720

https://github.com/ceph/ceph/pull/38044#issuecomment-790157415

In the end we never solved the problem afaik.  We still rely on bluestore_buffered_io and the linux page cache to hide excessive disk reads (especially the repeated prefetch reads mentioned in the PR!) when the block cache misses.  In your case, with very limited memory and potentially limited page cache, we are likely hitting RocksDB exceptionally hard for metadata lookups and there's no guarantee that the page cache itself isn't thrashing as well when the block cache misses.  If that's the case, the bluestore_buffered_io "fix" would not be able to mitigate the repeated prefetch reads we observed.  That may be related to what you are observing in your setup as well.


Mark

On 6/24/25 08:06, Mark Nelson wrote:
Hi Hector,

Sorry I'm a bit late to the party on this one.  I wrote the OSD memory autotuning code and am probably one of the most recent people to really dig in and refactor bluestore's caches.  I'll respond inline below.

On 6/22/25 05:51, Hector Martin wrote:
Hi all,

I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.

I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.
I suspect that what you are seeing here are RocksDB reads, since we allow those to come from page cache.  We have a long standing issue where RocksDB was repeatedly re-reading the same blocks for readahead when data isn't in the block cache and buffered IO is the crutch basically allowing it to perform well.  Tyler Stachecki mentioned the bluefs_buffered_io setting and that indeed is directly related to this issue and why we''ve flip-flopped on whether to leave it enabled.  This is a bit of a distraction imho though, see below.



I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).

In bluestore itself we have two primary caches:  The onode cache and the buffer cache.  Onode cache stores the metadata about objects while buffer cache stores object data.  A cache miss in either of these will result in data being read, but onode cache misses are far more expensive because it requires extra work to go fetch the data from RocksDB (even if the data is stored in RocksDB's block cache!)  It may even result in extra reads of the SST files if the data can't easily be found in early RocksDB levels.

To me, what you are seeing sounds like severe onode cache misses, potentially combined with additional read throughput overhead due to cache thrashing during compaction if snaptrimming is resulting in a lot of writes or deletes (combined with listings).



I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?

Just to start, the onode cache hit and miss rates in the OSD performance counter is a good place to start.  As Anthony mentioned, things tend not to work very well once you drop below ~1.5-2GB of memory.  There are some tricks you can play that may help (reducing various buffer sizes, pglog lengths, RocksDB WAL buffer sizes and counts, etc).  The honest truth though is that Ceph is incredibly sensitive to onode cache missses, especially if the backend storage is not very fast.



Here is a perf call trace of the thrashing:

- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) - OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, ECListener&) - 88.78% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>) - BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusiv - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*) - 88.71% BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&) - BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&) - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ - 88.65% BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>) - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::ch - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*) - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Sl - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bo - 88.32% rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, rocksdb::SliceTransfor - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks - 88.29% rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb - rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::Fi - rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocks - rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()
[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Sli
[ note: at this level the call trace splits into 4, but it all leads to
the same place ]
- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user

Note how much of the perf call trace is spent in RocksDB fetching blocks as a result of trying to load BlueStore shared_blobs via onodes.  Presumably your onode hit ratio is very low, and bluestore is trying to fetch a significant ratio of the onodes from RocksDB, which itself likely has very little block cache, so each onode read is coming from RocksDB, which is likely trying to fetch a block from disk, putting it in the block cache, but then may be immediately discarding it if there is very little memory for the next block.

One thing that's missing here is a description of the workload that is happening between snapshots.  Are you doing a lot of IO that could result in severe fragmentation between snapshots?


Thanks,

Mark


- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to