Hi Hector,
Just as a follow-up,
Here are the comments I mentioned in the ceph slack channel from the PR
where we re-enabled bluestore_buffered_io. I recorded these notes back
when aclamk and I were digging into the RocksDB code to see if there was
anything we could do to improve the situation. They may be valuable if
you want to dig into the RocksDB side of what is going on.
https://github.com/ceph/ceph/pull/38044#issuecomment-789192720
https://github.com/ceph/ceph/pull/38044#issuecomment-790157415
In the end we never solved the problem afaik. We still rely on
bluestore_buffered_io and the linux page cache to hide excessive disk
reads (especially the repeated prefetch reads mentioned in the PR!) when
the block cache misses. In your case, with very limited memory and
potentially limited page cache, we are likely hitting RocksDB
exceptionally hard for metadata lookups and there's no guarantee that
the page cache itself isn't thrashing as well when the block cache
misses. If that's the case, the bluestore_buffered_io "fix" would not
be able to mitigate the repeated prefetch reads we observed. That may
be related to what you are observing in your setup as well.
Mark
On 6/24/25 08:06, Mark Nelson wrote:
Hi Hector,
Sorry I'm a bit late to the party on this one. I wrote the OSD memory
autotuning code and am probably one of the most recent people to
really dig in and refactor bluestore's caches. I'll respond inline
below.
On 6/22/25 05:51, Hector Martin wrote:
Hi all,
I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.
I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.
I suspect that what you are seeing here are RocksDB reads, since we
allow those to come from page cache. We have a long standing issue
where RocksDB was repeatedly re-reading the same blocks for readahead
when data isn't in the block cache and buffered IO is the crutch
basically allowing it to perform well. Tyler Stachecki mentioned the
bluefs_buffered_io setting and that indeed is directly related to this
issue and why we''ve flip-flopped on whether to leave it enabled.
This is a bit of a distraction imho though, see below.
I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).
In bluestore itself we have two primary caches: The onode cache and
the buffer cache. Onode cache stores the metadata about objects while
buffer cache stores object data. A cache miss in either of these will
result in data being read, but onode cache misses are far more
expensive because it requires extra work to go fetch the data from
RocksDB (even if the data is stored in RocksDB's block cache!) It may
even result in extra reads of the SST files if the data can't easily
be found in early RocksDB levels.
To me, what you are seeing sounds like severe onode cache misses,
potentially combined with additional read throughput overhead due to
cache thrashing during compaction if snaptrimming is resulting in a
lot of writes or deletes (combined with listings).
I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?
Just to start, the onode cache hit and miss rates in the OSD
performance counter is a good place to start. As Anthony mentioned,
things tend not to work very well once you drop below ~1.5-2GB of
memory. There are some tricks you can play that may help (reducing
various buffer sizes, pglog lengths, RocksDB WAL buffer sizes and
counts, etc). The honest truth though is that Ceph is incredibly
sensitive to onode cache missses, especially if the backend storage is
not very fast.
Here is a perf call trace of the thrashing:
- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
- OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
- 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t,
boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&,
ECListener&)
- 88.78% non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&,
boost::intrusive_ptr<OpRequest>)
-
BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&, boost::intrusiv
- 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ceph::os::Transaction*)
- 88.71% BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- 88.69% BlueStore::_do_truncate(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
- 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
- 88.65%
BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
- 88.64% RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::ch
- 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)
- rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
- 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*,
rocksdb::PinnableWideColumns*, std
- 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&,
rocksdb::Sl
- rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform
const*, bo
- 88.32%
rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*,
rocksdb::Slice const&, bool, rocksdb::SliceTransfor
- rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
- 88.29%
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
rocksdb::GetContext*, rocksdb
-
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
const*, rocksdb::Fi
- rocksdb::Status
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
rocks
- rocksdb::Status
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()
[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions
const&, unsigned long, unsigned long, rocksdb::Sli
[ note: at this level the call trace splits into 4, but it all leads to
the same place ]
- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
rocksdb::Slice*, char*) const
BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
long, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user
Note how much of the perf call trace is spent in RocksDB fetching
blocks as a result of trying to load BlueStore shared_blobs via
onodes. Presumably your onode hit ratio is very low, and bluestore is
trying to fetch a significant ratio of the onodes from RocksDB, which
itself likely has very little block cache, so each onode read is
coming from RocksDB, which is likely trying to fetch a block from
disk, putting it in the block cache, but then may be immediately
discarding it if there is very little memory for the next block.
One thing that's missing here is a description of the workload that is
happening between snapshots. Are you doing a lot of IO that could
result in severe fragmentation between snapshots?
Thanks,
Mark
- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
--
Best Regards,
Mark Nelson
Head of Research and Development
Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nel...@clyso.com
We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io