The default OSD memory cache size is 4 GB, it’s not recommended to
reduce it to such low values, especially if there’s real load on the
cluster. I am not a developer, so I can’t really comment on the code.
Zitat von Hector Martin <mar...@marcan.st>:
Hi all,
I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.
I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.
I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).
I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?
Here is a perf call trace of the thrashing:
- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
- OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
- 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t,
boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace
const&, ECListener&)
- 88.78% non-virtual thunk to
PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
std::allocator<ceph::os::Transaction> >&,
boost::intrusive_ptr<OpRequest>)
-
BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&,
boost::intrusiv
- 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ceph::os::Transaction*)
- 88.71% BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&)
- 88.69% BlueStore::_do_truncate(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
- 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*,
boost::intrusive_ptr<BlueStore::Collection>&,
boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
- 88.65%
BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
- 88.64% RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::ch
- 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&,
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&,
rocksdb::PinnableSlice*)
- rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
- 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&,
rocksdb::LookupKey const&, rocksdb::PinnableSlice*,
rocksdb::PinnableWideColumns*, std
- 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&,
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData
const&, rocksdb::Sl
- rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&,
rocksdb::Slice const&, rocksdb::GetContext*,
rocksdb::SliceTransform const*, bo
- 88.32%
rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool,
rocksdb::SliceTransfor
- rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&,
bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
- 88.29%
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*,
rocksdb
-
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*,
rocksdb::Fi
- rocksdb::Status
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
rocks
- rocksdb::Status
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()
[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions
const&, unsigned long, unsigned long, rocksdb::Sli
[ note: at this level the call trace splits into 4, but it all leads to
the same place ]
- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
rocksdb::Slice*, char*) const
BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
long, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user
- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io