[ceph-users] Re: Debugging OSD cache thrashing

Hector Martin Sun, 22 Jun 2025 17:47:41 -0700


On 2025/06/22 22:37, Eugen Block wrote:
> The default OSD memory cache size is 4 GB, it’s not recommended to  
> reduce it to such low values, especially if there’s real load on the  
> cluster. I am not a developer, so I can’t really comment on the code.


I don't have enough RAM for 4GB per OSD. Nonetheless, this behavior is
clearly broken. I also confirmed that `ceph tell osd.X cache drop` also
fixes the CPU usage (until the cache fills up again). If something is
assuming the cache is at least 4GB when sizing partitions or something
like that, then I'd like to know how to tune it (and the default should
probably be improved to remove that assumption).

> 
> Zitat von Hector Martin <mar...@marcan.st>:
> 
>> Hi all,
>>
>> I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
>> 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
>> usage and cluster performance plummets. Most of the CPU usage was
>> accounted as "system", while actual disk I/O usage was low, so that
>> didn't sound right.
>>
>> I perf traced the system, and found that most of the usage is in
>> __arch_copy_to_user in the kernel, in read() syscalls. That sounds like
>> the OSD is thrashing buffered block device reads which the kernel
>> satisfies from the page cache (hence no real I/O load, most of the CPU
>> usage is from data copies), so it's repeatedly reading the same disk
>> blocks, which doesn't sound right.
>>
>> I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
>> config`, and CPU usage immediately dropped to near zero. However,
>> waiting a bit eventually the CPU thrashing returns after memory usage
>> increases. Restarting an OSD has a similar effect. I believe that
>> something is wrong with the OSD bluestore cache allocation/flush policy,
>> and when the cache becomes full it starts thrashing reads instead of
>> evicting colder cached data (or perhaps some cache bucket is starving
>> another cache bucket of space).
>>
>> I would appreciate some hints on how to debug this. Are there any cache
>> stats I should be looking at, or info on how the cache is partitioned?
>>
>> Here is a perf call trace of the thrashing:
>>
>>> - OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>> - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,  
>>> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>> - OSD::dequeue_op(boost::intrusive_ptr<PG>,  
>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>> - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,  
>>> ThreadPool::TPHandle&)
>>> - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>> - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>> - 88.86% ECBackend::handle_sub_write(pg_shard_t,  
>>> boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace  
>>> const&, ECListener&)
>>> - 88.78% non-virtual thunk to  
>>> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,  
>>> std::allocator<ceph::os::Transaction> >&,  
>>> boost::intrusive_ptr<OpRequest>)
>>> -  
>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
>>>  std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> 
>>> >&,  
>>> boost::intrusiv
>>> - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*,  
>>> ceph::os::Transaction*)
>>> - 88.71% BlueStore::_remove(BlueStore::TransContext*,  
>>> boost::intrusive_ptr<BlueStore::Collection>&,  
>>> boost::intrusive_ptr<BlueStore::Onode>&)
>>> - BlueStore::_do_remove(BlueStore::TransContext*,  
>>> boost::intrusive_ptr<BlueStore::Collection>&,  
>>> boost::intrusive_ptr<BlueStore::Onode>&)
>>> - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*,  
>>> boost::intrusive_ptr<BlueStore::Collection>&,  
>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
>>> - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*,  
>>> boost::intrusive_ptr<BlueStore::Collection>&,  
>>> boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
>>> - 88.65%  
>>> BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
>>> - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char,  
>>> std::char_traits<char>, std::allocator<char> > const&,  
>>> std::__cxx11::basic_string<char, std::ch
>>> - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,  
>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,  
>>> rocksdb::PinnableSlice*)
>>> - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,  
>>> rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
>>> - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&,  
>>> rocksdb::LookupKey const&, rocksdb::PinnableSlice*,  
>>> rocksdb::PinnableWideColumns*, std
>>> - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&,  
>>> rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData  
>>> const&, rocksdb::Sl
>>> - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,  
>>> rocksdb::Slice const&, rocksdb::GetContext*,  
>>> rocksdb::SliceTransform const*, bo
>>> - 88.32%  
>>> rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*,
>>>  rocksdb::Slice const&, bool,  
>>> rocksdb::SliceTransfor
>>> - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,  
>>> bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
>>> - 88.29%  
>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
>>>  rocksdb::GetContext*,  
>>> rocksdb
>>> -  
>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
>>>  const*,  
>>> rocksdb::Fi
>>> - rocksdb::Status  
>>> rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
>>>   
>>> rocks
>>> - rocksdb::Status  
>>> rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
>>> - 88.25% rocksdb::BlockFetcher::ReadBlockContents()
>> [ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
>>> - 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions  
>>> const&, unsigned long, unsigned long, rocksdb::Sli
>> [ note: at this level the call trace splits into 4, but it all leads to
>> the same place ]
>>> - 35.99% 0xaaaac0992e40
>>> BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,  
>>> rocksdb::Slice*, char*) const
>>> BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned  
>>> long, char*)
>>> - KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
>>> - 35.98% __libc_pread
>>> - el0_svc
>>> - invoke_syscall
>>> - 35.97% __arm64_sys_pread64
>>> - 35.96% vfs_read
>>> - blkdev_read_iter
>>> - 35.93% filemap_read
>>> - 35.28% copy_page_to_iter
>>> 35.01% __arch_copy_to_user
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to