Hi all, I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU usage and cluster performance plummets. Most of the CPU usage was accounted as "system", while actual disk I/O usage was low, so that didn't sound right.
I perf traced the system, and found that most of the usage is in __arch_copy_to_user in the kernel, in read() syscalls. That sounds like the OSD is thrashing buffered block device reads which the kernel satisfies from the page cache (hence no real I/O load, most of the CPU usage is from data copies), so it's repeatedly reading the same disk blocks, which doesn't sound right. I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph config`, and CPU usage immediately dropped to near zero. However, waiting a bit eventually the CPU thrashing returns after memory usage increases. Restarting an OSD has a similar effect. I believe that something is wrong with the OSD bluestore cache allocation/flush policy, and when the cache becomes full it starts thrashing reads instead of evicting colder cached data (or perhaps some cache bucket is starving another cache bucket of space). I would appreciate some hints on how to debug this. Are there any cache stats I should be looking at, or info on how the cache is partitioned? Here is a perf call trace of the thrashing: > - OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) > - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) > - OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, > ThreadPool::TPHandle&) > - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > ThreadPool::TPHandle&) > - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) > - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>) > - 88.86% ECBackend::handle_sub_write(pg_shard_t, > boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, > ECListener&) > - 88.78% non-virtual thunk to > PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, > std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>) > - > BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, > boost::intrusiv > - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ceph::os::Transaction*) > - 88.71% BlueStore::_remove(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>&) > - BlueStore::_do_remove(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>&) > - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std > - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ > - 88.65% > BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>) > - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&, > std::__cxx11::basic_string<char, std::ch > - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*) > - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice > const&, rocksdb::DBImpl::GetImplOptions&) > - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, > rocksdb::LookupKey const&, rocksdb::PinnableSlice*, > rocksdb::PinnableWideColumns*, std > - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, > rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, > rocksdb::Sl > - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice > const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bo > - 88.32% > rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, > rocksdb::Slice const&, bool, rocksdb::SliceTransfor > - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, > rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks > - 88.29% > rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, > rocksdb::GetContext*, rocksdb > - > rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable > const*, rocksdb::Fi > - rocksdb::Status > rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, > rocks > - rocksdb::Status > rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref > - 88.25% rocksdb::BlockFetcher::ReadBlockContents() [ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum] > - 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, > unsigned long, unsigned long, rocksdb::Sli [ note: at this level the call trace splits into 4, but it all leads to the same place ] > - 35.99% 0xaaaac0992e40 > BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, > rocksdb::Slice*, char*) const > BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*) > - KernelDevice::read_random(unsigned long, unsigned long, char*, bool) > - 35.98% __libc_pread > - el0_svc > - invoke_syscall > - 35.97% __arm64_sys_pread64 > - 35.96% vfs_read > - blkdev_read_iter > - 35.93% filemap_read > - 35.28% copy_page_to_iter > 35.01% __arch_copy_to_user - Hector _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io