I did try and run sudo ceph-bluestore-tool --out-dir /mnt/ceph bluefs-export . but it died after writing out 93GB and filling up my root partition.
On Thu, Jul 11, 2019 at 3:32 PM Brett Chancellor <bchancel...@salesforce.com> wrote: > We moved the .rgw.meta data pool over to SSD to try and improve > performance, during the backfill SSDs bgan dying in mass. Log attached to > this case > https://tracker.ceph.com/issues/40741 > > Right now the SSD's wont come up with either allocator and the cluster is > pretty much dead. > > What are the consequences of deleting the .rgw.meta pool? Can it be > recreated? > > On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de <ifedo...@suse.de> wrote: > >> You might want to try manual rocksdb compaction using ceph-kvstore-tool.. >> >> Sent from my Huawei tablet >> >> >> -------- Original Message -------- >> Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart >> From: Brett Chancellor >> To: Igor Fedotov >> CC: Ceph Users >> >> Once backfilling finished, the cluster was super slow, most osd's were >> filled with heartbeat_map errors. When an OSD restarts it causes a cascade >> of other osd's to follow suit and restart.. logs like.. >> -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581 >> get_health_metrics reporting 21 slow ops, oldest is >> osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded) >> ack+ondisk+write+known_if_redirected e1348522) >> -2> 2019-07-10 18:34:50.967 7f34acf5d700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90 >> -1> 2019-07-10 18:34:50.967 7f34acf5d700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150 >> 0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal >> (Aborted) ** >> in thread 7f3493f2b700 thread_name:tp_osd_tp >> >> ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus >> (stable) >> 1: (()+0xf5d0) [0x7f34b57c25d0] >> 2: (pread64()+0x33) [0x7f34b57c1f63] >> 3: (KernelDevice::read_random(unsigned long, unsigned long, char*, >> bool)+0x238) [0x55bfdae5a448] >> 4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned >> long, char*)+0xca) [0x55bfdae1271a] >> 5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, >> rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440] >> 6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, >> rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0] >> 7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27] >> 8: (()+0x11146a4) [0x55bfdb40d6a4] >> 9: >> (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*, >> rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, >> rocksdb::BlockHandle const&, rocksdb::Slice, >> rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*, bool, >> rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c] >> 10: (rocksdb::DataBlockIter* >> rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::BlockBasedTable::Rep*, >> rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, >> rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*, >> rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29] >> 11: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, >> rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588] >> 12: (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter, >> rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d] >> 13: (()+0x10adde9) [0x55bfdb3a6de9] >> 14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4] >> 15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762) >> [0x55bfdb32a092] >> 16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6] >> 17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d) >> [0x55bfdad9fa8d] >> 18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t >> const&, ghobject_t const&, int, std::vector<ghobject_t, >> std::allocator<ghobject_t> >*, ghobject_t*)+0xdf6) [0x55bfdad12466] >> 19: >> (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, >> ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, >> std::allocator<ghobject_t> >*, ghobject_t*)+0x9b) [0x55bfdad1393b] >> 20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120] >> 21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38) >> [0x55bfda985598] >> 22: (boost::statechart::simple_state<PG::RecoveryState::Deleting, >> PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na>, >> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base >> const&, void const*)+0x16a) [0x55bfda9c45ca] >> 23: >> (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, >> PG::RecoveryState::Initial, std::allocator<void>, >> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base >> const&)+0x5a) [0x55bfda9a20ca] >> 24: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, >> PG::RecoveryCtx*)+0x119) [0x55bfda991389] >> 25: (OSD::dequeue_peering_evt(OSDShard*, PG*, >> std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1b4) >> [0x55bfda8cb3c4] >> 26: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, >> ThreadPool::TPHandle&)+0x234) [0x55bfda8cb804] >> 27: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x9f4) [0x55bfda8bfb44] >> 28: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) >> [0x55bfdaeb9e93] >> 29: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55bfdaebcf30] >> 30: (()+0x7dd5) [0x7f34b57badd5] >> 31: (clone()+0x6d) [0x7f34b4680ead] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> --- logging levels --- >> 0/ 5 none >> 0/ 1 lockdep >> 0/ 1 context >> 1/ 1 crush >> 1/ 5 mds >> 1/ 5 mds_balancer >> 1/ 5 mds_locker >> 1/ 5 mds_log >> 1/ 5 mds_log_expire >> 1/ 5 mds_migrator >> 0/ 1 buffer >> 0/ 1 timer >> 0/ 1 filer >> 0/ 1 striper >> 0/ 1 objecter >> 0/ 5 rados >> 0/ 5 rbd >> 0/ 5 rbd_mirror >> 0/ 5 rbd_replay >> 0/ 5 journaler >> 0/ 5 objectcacher >> 0/ 5 client >> 1/ 5 osd >> 0/ 5 optracker >> 0/ 5 objclass >> 1/ 3 filestore >> 1/ 3 journal >> 0/ 0 ms >> 1/ 5 mon >> 0/10 monc >> 1/ 5 paxos >> 0/ 5 tp >> 1/ 5 auth >> 1/ 5 crypto >> 1/ 1 finisher >> 1/ 1 reserver >> 1/ 5 heartbeatmap >> 1/ 5 perfcounter >> 1/ 5 rgw >> 1/ 5 rgw_sync >> 1/10 civetweb >> 1/ 5 javaclient >> 1/ 5 asok >> 1/ 1 throttle >> 0/ 0 refs >> 1/ 5 xio >> 1/ 5 compressor >> 1/ 5 bluestore >> 1/ 5 bluefs >> 1/ 3 bdev >> 1/ 5 kstore >> 4/ 5 rocksdb >> 4/ 5 leveldb >> 4/ 5 memdb >> 1/ 5 kinetic >> 1/ 5 fuse >> 1/ 5 mgr >> 1/ 5 mgrc >> 1/ 5 dpdk >> 1/ 5 eventtrace >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 10000 >> max_new 1000 >> log_file /var/log/ceph/ceph-osd.69.log >> --- end dump of recent events --- >> >> On Tue, Jul 9, 2019 at 1:38 PM Igor Fedotov <ifedo...@suse.de> wrote: >> >> This will cap single bluefs space allocation. Currently it attempts to >> allocate 70Gb which seems to overflow some 32-bit length fields. With the >> adjustment such allocation should be capped at ~700MB. >> >> I doubt there is any relation between this specific failure and the pool. >> At least at the moment. >> >> In short the history is: starting OSD tries to flush bluefs data to disk, >> detects lack of space and asks for more from main device - allocations >> succeeds but returned extent has length field set to 0. >> On 7/9/2019 8:33 PM, Brett Chancellor wrote: >> >> What does bluestore_bluefs_gift_ratio do? I can't find any documentation >> on it. Also do you think this could be related to the .rgw.meta pool >> having too many objects per PG? The disks that die always seem to be >> backfilling a pg from that pool, and they have ~550k objects per PG. >> >> -Brett >> >> On Tue, Jul 9, 2019 at 1:03 PM Igor Fedotov <ifedo...@suse.de> wrote: >> >> Please try to set bluestore_bluefs_gift_ratio to 0.0002 >> >> >> On 7/9/2019 7:39 PM, Brett Chancellor wrote: >> >> Too large for pastebin.. The problem is continually crashing new OSDs. >> Here is the latest one. >> >> On Tue, Jul 9, 2019 at 11:46 AM Igor Fedotov <ifedo...@suse.de> wrote: >> >> could you please set debug bluestore to 20 and collect startup log for >> this specific OSD once again? >> >> >> On 7/9/2019 6:29 PM, Brett Chancellor wrote: >> >> I restarted most of the OSDs with the stupid allocator (6 of them >> wouldn't start unless bitmap allocator was set), but I'm still seeing >> issues with OSDs crashing. Interestingly it seems that the dying OSDs are >> always working on a pg from the .rgw.meta pool when they crash. >> >> Log : https://pastebin.com/yuJKcPvX >> >> On Tue, Jul 9, 2019 at 5:14 AM Igor Fedotov <ifedo...@suse.de> wrote: >> >> Hi Brett, >> >> in Nautilus you can do that via >> >> ceph config set osd.N bluestore_allocator stupid >> >> ceph config set osd.N bluefs_allocator stupid >> >> See >> https://ceph.com/community/new-mimic-centralized-configuration-management/ >> for more details on a new way of configuration options setting. >> >> >> A known issue with Stupid allocator is gradual write request latency >> increase (occurred within several days after OSD restart). Seldom observed >> though. There were some posts about that behavior in the mail list this >> year. >> >> Thanks, >> >> Igor. >> >> >> On 7/8/2019 8:33 PM, Brett Chancellor wrote: >> >> >> I'll give that a try. Is it something like... >> ceph tell 'osd.*' bluestore_allocator stupid >> ceph tell 'osd.*' bluefs_allocator stupid >> >> And should I expect any issues doing this? >> >> >> On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov <ifedo...@suse.de> wrote: >> >> I should read call stack more carefully... It's not about lacking free >> space - this is rather the bug from this ticket: >> >> http://tracker.ceph.com/issues/40080 >> >> >> You should upgrade to v14.2.2 (once it's available) or temporarily switch >> to stupid allocator as a workaround. >> >> >> Thanks, >> >> Igor >> >> >> >> On 7/8/2019 8:00 PM, Igor Fedotov wrote: >> >> Hi Brett, >> >> looks like BlueStore is unable to allocate additional space for BlueFS at >> main device. It's either lacking free space or it's too fragmented... >> >> Would you share osd log, please? >> >> Also please run "ceph-bluestore-tool --path <substitute with >> path-to-osd!!!> bluefs-bdev-sizes" and share the output. >> >> Thanks, >> >> Igor >> On 7/3/2019 9:59 PM, Brett Chancellor wrote: >> >> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart, >> all with the same error. These OSDs are all on different hosts. All are >> running 14.2.1 >> >> I did try the following two commands >> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys >> ## This failed with the same error below >> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck >> ## After a couple of hours returned... >> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1 >> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record >> found, suggest to run store repair to get consistent statistic reports >> fsck success >> >> >> ## Error when trying to start one of the OSDs >> -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal >> (Aborted) ** >> in thread 7f5e42366700 thread_name:rocksdb:low0 >> >> ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus >> (stable) >> 1: (()+0xf5d0) [0x7f5e50bd75d0] >> 2: (gsignal()+0x37) [0x7f5e4f9ce207] >> 3: (abort()+0x148) [0x7f5e4f9cf8f8] >> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x199) [0x55a7aaee96ab] >> 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char >> const*, char const*, ...)+0) [0x55a7aaee982a] >> 6: (interval_set<unsigned long, std::map<unsigned long, unsigned long, >> std::less<unsigned long>, std::allocator<std::pair<unsigned long const, >> unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*, >> unsigned long*)+0x3c6) [0x55a7ab212a66] >> 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long, >> std::vector<bluestore_pextent_t, >> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >> >*)+0x74e) [0x55a7ab48253e] >> 8: (BlueFS::_expand_slow_device(unsigned long, >> std::vector<bluestore_pextent_t, >> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >> >&)+0x111) [0x55a7ab59e921] >> 9: (BlueFS::_allocate(unsigned char, unsigned long, >> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b] >> 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned >> long)+0xe5) [0x55a7ab59fce5] >> 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b] >> 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d] >> 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e] >> 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee] >> 15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status >> const&, rocksdb::CompactionJob::SubcompactionState*, >> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice >> const*)+0xbaa) [0x55a7abc3b73a] >> 16: >> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0) >> [0x55a7abc3f150] >> 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618] >> 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, >> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) >> [0x55a7aba7fb67] >> 19: >> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, >> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0] >> 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a] >> 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) >> [0x55a7abc8d9c4] >> 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f) >> [0x55a7abc8db4f] >> 23: (()+0x129dfff) [0x55a7abd1afff] >> 24: (()+0x7dd5) [0x7f5e50bcfdd5] >> 25: (clone()+0x6d) [0x7f5e4fa95ead] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> _______________________________________________ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> _______________________________________________ >> ceph-users mailing >> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com