Re: [ceph-users] 3 OSDs stopped and unable to restart

Igor Fedotov Fri, 12 Jul 2019 06:41:52 -0700

Left some notes in the ticket..

On 7/11/2019 10:32 PM, Brett Chancellor wrote:

We moved the .rgw.meta data pool over to SSD to try and improveperformance, during the backfill SSDs bgan dying in mass. Log attachedto this case

https://tracker.ceph.com/issues/40741

Right now the SSD's wont come up with either allocator and the clusteris pretty much dead.

What are the consequences of deleting the .rgw.meta pool? Can it berecreated?

On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de<mailto:ifedo...@suse.de> <ifedo...@suse.de <mailto:ifedo...@suse.de>>wrote:


    You might want to try manual rocksdb compaction using
    ceph-kvstore-tool..

    Sent from my Huawei tablet


    -------- Original Message --------
    Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
    From: Brett Chancellor
    To: Igor Fedotov
    CC: Ceph Users

            Once backfilling finished, the cluster was super slow,
            most osd's were filled with heartbeat_map errors.  When an
            OSD restarts it causes a cascade of other osd's to follow
            suit and restart.. logs like..
              -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69
            1348581 get_health_metrics reporting 21 slow ops, oldest
            is osd_op(client.115295041.0:17575966 15.c37fa482
            15.c37fa482 (undecoded)
            ack+ondisk+write+known_if_redirected e1348522)
                -2> 2019-07-10 18:34:50.967 7f34acf5d700  1
            heartbeat_map is_healthy 'OSD::osd_op_tp thread
            0x7f3493f2b700' had timed out after 90
                -1> 2019-07-10 18:34:50.967 7f34acf5d700  1
            heartbeat_map is_healthy 'OSD::osd_op_tp thread
            0x7f3493f2b700' had suicide timed out after 150
                 0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught
            signal (Aborted) **
             in thread 7f3493f2b700 thread_name:tp_osd_tp

             ceph version 14.2.1
            (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus (stable)
             1: (()+0xf5d0) [0x7f34b57c25d0]
             2: (pread64()+0x33) [0x7f34b57c1f63]
             3: (KernelDevice::read_random(unsigned long, unsigned
            long, char*, bool)+0x238) [0x55bfdae5a448]
             4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned
            long, unsigned long, char*)+0xca) [0x55bfdae1271a]
             5: (BlueRocksRandomAccessFile::Read(unsigned long,
            unsigned long, rocksdb::Slice*, char*) const+0x20)
            [0x55bfdae3b440]
             6: (rocksdb::RandomAccessFileReader::Read(unsigned long,
            unsigned long, rocksdb::Slice*, char*) const+0x960)
            [0x55bfdb466ba0]
             7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7)
            [0x55bfdb420c27]
             8: (()+0x11146a4) [0x55bfdb40d6a4]
             9:
            
(rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
            rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions
            const&, rocksdb::BlockHandle const&, rocksdb::Slice,
            rocksdb::BlockBasedTable::CachableEntry<rocksdb::Block>*,
            bool, rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
             10: (rocksdb::DataBlockIter*
            
rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::BlockBasedTable::Rep*,
            rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
            rocksdb::DataBlockIter*, bool, bool, bool,
            rocksdb::GetContext*, rocksdb::Status,
            rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
             11:
            (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter,
            rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
             12:
            (rocksdb::BlockBasedTableIterator<rocksdb::DataBlockIter,
            rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
             13: (()+0x10adde9) [0x55bfdb3a6de9]
             14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
             15: (rocksdb::DBIter::FindNextUserEntryInternal(bool,
            bool)+0x762) [0x55bfdb32a092]
             16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
             17:
            (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
            [0x55bfdad9fa8d]
             18: (BlueStore::_collection_list(BlueStore::Collection*,
            ghobject_t const&, ghobject_t const&, int,
            std::vector<ghobject_t, std::allocator<ghobject_t> >*,
            ghobject_t*)+0xdf6) [0x55bfdad12466]
             19:
            
(BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
            ghobject_t const&, ghobject_t const&, int,
            std::vector<ghobject_t, std::allocator<ghobject_t> >*,
            ghobject_t*)+0x9b) [0x55bfdad1393b]
             20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0)
            [0x55bfda984120]
             21: (PG::RecoveryState::Deleting::react(PG::DeleteSome
            const&)+0x38) [0x55bfda985598]
             22:
            (boost::statechart::simple_state<PG::RecoveryState::Deleting,
            PG::RecoveryState::ToDelete, boost::mpl::list<mpl_::na,
            mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
            mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
            mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
            mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
            
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
            const&, void const*)+0x16a) [0x55bfda9c45ca]
             23:
            
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
            PG::RecoveryState::Initial, std::allocator<void>,
            
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
            const&)+0x5a) [0x55bfda9a20ca]
             24:
            (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>,
            PG::RecoveryCtx*)+0x119) [0x55bfda991389]
             25: (OSD::dequeue_peering_evt(OSDShard*, PG*,
            std::shared_ptr<PGPeeringEvent>,
            ThreadPool::TPHandle&)+0x1b4) [0x55bfda8cb3c4]
             26: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
            ThreadPool::TPHandle&)+0x234) [0x55bfda8cb804]
             27: (OSD::ShardedOpWQ::_process(unsigned int,
            ceph::heartbeat_handle_d*)+0x9f4) [0x55bfda8bfb44]
             28: (ShardedThreadPool::shardedthreadpool_worker(unsigned
            int)+0x433) [0x55bfdaeb9e93]
             29: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
            [0x55bfdaebcf30]
             30: (()+0x7dd5) [0x7f34b57badd5]
             31: (clone()+0x6d) [0x7f34b4680ead]
             NOTE: a copy of the executable, or `objdump -rdS
            <executable>` is needed to interpret this.

            --- logging levels ---
               0/ 5 none
               0/ 1 lockdep
               0/ 1 context
               1/ 1 crush
               1/ 5 mds
               1/ 5 mds_balancer
               1/ 5 mds_locker
               1/ 5 mds_log
               1/ 5 mds_log_expire
               1/ 5 mds_migrator
               0/ 1 buffer
               0/ 1 timer
               0/ 1 filer
               0/ 1 striper
               0/ 1 objecter
               0/ 5 rados
               0/ 5 rbd
               0/ 5 rbd_mirror
               0/ 5 rbd_replay
               0/ 5 journaler
               0/ 5 objectcacher
               0/ 5 client
               1/ 5 osd
               0/ 5 optracker
               0/ 5 objclass
               1/ 3 filestore
               1/ 3 journal
               0/ 0 ms
               1/ 5 mon
               0/10 monc
               1/ 5 paxos
               0/ 5 tp
               1/ 5 auth
               1/ 5 crypto
               1/ 1 finisher
               1/ 1 reserver
               1/ 5 heartbeatmap
               1/ 5 perfcounter
               1/ 5 rgw
               1/ 5 rgw_sync
               1/10 civetweb
               1/ 5 javaclient
               1/ 5 asok
               1/ 1 throttle
               0/ 0 refs
               1/ 5 xio
               1/ 5 compressor
               1/ 5 bluestore
               1/ 5 bluefs
               1/ 3 bdev
               1/ 5 kstore
               4/ 5 rocksdb
               4/ 5 leveldb
               4/ 5 memdb
               1/ 5 kinetic
               1/ 5 fuse
               1/ 5 mgr
               1/ 5 mgrc
               1/ 5 dpdk
               1/ 5 eventtrace
              -2/-2 (syslog threshold)
              -1/-1 (stderr threshold)
              max_recent     10000
              max_new         1000
              log_file /var/log/ceph/ceph-osd.69.log
            --- end dump of recent events ---

            On Tue, Jul 9, 2019 at 1:38 PM Igor Fedotov
            <ifedo...@suse.de <mailto:ifedo...@suse.de>> wrote:

                This will cap single bluefs space allocation.
                Currently it attempts to allocate 70Gb which seems to
                overflow some 32-bit length fields. With the
                adjustment such allocation should be capped at ~700MB.

                I doubt there is any relation between this specific
                failure and the pool. At least at the moment.

                In short the history is: starting OSD tries to flush
                bluefs data to disk, detects lack of space and asks
                for more from main device - allocations succeeds but
                returned extent has length field set to 0.

                On 7/9/2019 8:33 PM, Brett Chancellor wrote:

                    What does bluestore_bluefs_gift_ratio do?  I can't
                    find any documentation on it.  Also do you think
                    this could be related to the .rgw.meta pool having
                    too many objects per PG? The disks that die always
                    seem to be backfilling a pg from that pool, and
                    they have ~550k objects per PG.

                    -Brett

                    On Tue, Jul 9, 2019 at 1:03 PM Igor Fedotov
                    <ifedo...@suse.de <mailto:ifedo...@suse.de>> wrote:

                        Please try to set bluestore_bluefs_gift_ratio
                        to 0.0002


                        On 7/9/2019 7:39 PM, Brett Chancellor wrote:

                            Too large for pastebin.. The problem is
                            continually crashing new OSDs. Here is the
                            latest one.

                            On Tue, Jul 9, 2019 at 11:46 AM Igor
                            Fedotov <ifedo...@suse.de
                            <mailto:ifedo...@suse.de>> wrote:

                                could you please set debug bluestore
                                to 20 and collect startup log for this
                                specific OSD once again?


                                On 7/9/2019 6:29 PM, Brett Chancellor
                                wrote:

                                    I restarted most of the OSDs with
                                    the stupid allocator (6 of them
                                    wouldn't start unless bitmap
                                    allocator was set), but I'm still
                                    seeing issues with OSDs crashing.
                                    Interestingly it seems that the
                                    dying OSDs are always working on a
                                    pg from the .rgw.meta pool when
                                    they crash.

                                    Log : https://pastebin.com/yuJKcPvX

                                    On Tue, Jul 9, 2019 at 5:14 AM
                                    Igor Fedotov <ifedo...@suse.de
                                    <mailto:ifedo...@suse.de>> wrote:

                                        Hi Brett,

                                        in Nautilus you can do that via

                                        ceph config set osd.N
                                        bluestore_allocator stupid

                                        ceph config set osd.N
                                        bluefs_allocator stupid

                                        See
                                        
https://ceph.com/community/new-mimic-centralized-configuration-management/
                                        for more details on a new way
                                        of configuration options setting.


                                        A known issue with Stupid
                                        allocator is gradual write
                                        request latency increase
                                        (occurred within several days
                                        after OSD restart). Seldom
                                        observed though. There were
                                        some posts about that behavior
                                        in the mail list  this year.

                                        Thanks,

                                        Igor.


                                        On 7/8/2019 8:33 PM, Brett
                                        Chancellor wrote:


                                            I'll give that a try.  Is
                                            it something like...
                                            ceph tell 'osd.*'
                                            bluestore_allocator stupid
                                            ceph tell 'osd.*'
                                            bluefs_allocator stupid

                                            And should I expect any
                                            issues doing this?


                                            On Mon, Jul 8, 2019 at
                                            1:04 PM Igor Fedotov
                                            <ifedo...@suse.de
                                            <mailto:ifedo...@suse.de>>
                                            wrote:

                                                I should read call
                                                stack more
                                                carefully... It's not
                                                about lacking free
                                                space - this is rather
                                                the bug from this ticket:

                                                
http://tracker.ceph.com/issues/40080


                                                You should upgrade to
                                                v14.2.2 (once it's
                                                available) or
                                                temporarily switch to
                                                stupid allocator as a
                                                workaround.


                                                Thanks,

                                                Igor



                                                On 7/8/2019 8:00 PM,
                                                Igor Fedotov wrote:

                                                    Hi Brett,

                                                    looks like
                                                    BlueStore is
                                                    unable to allocate
                                                    additional space
                                                    for BlueFS at main
                                                    device. It's
                                                    either lacking
                                                    free space or it's
                                                    too fragmented...

                                                    Would you share
                                                    osd log, please?

                                                    Also please run
                                                    "ceph-bluestore-tool
                                                    --path <substitute
                                                    with
                                                    path-to-osd!!!>
                                                    bluefs-bdev-sizes"
                                                    and share the output.

                                                    Thanks,

                                                    Igor

                                                    On 7/3/2019 9:59
                                                    PM, Brett
                                                    Chancellor wrote:

                                                        Hi All! Today
                                                        I've had 3
                                                        OSDs stop
                                                        themselves and
                                                        are unable to
                                                        restart, all
                                                        with the same
                                                        error. These
                                                        OSDs are all
                                                        on different
                                                        hosts. All are
                                                        running 14.2.1

                                                        I did try the
                                                        following two
                                                        commands
                                                        - ceph-kvstore-tool
                                                        bluestore-kv
                                                        
/var/lib/ceph/osd/ceph-80
                                                        list > keys
                                                          ## This
                                                        failed with
                                                        the same error
                                                        below
                                                        -
                                                        ceph-bluestore-tool
                                                        --path
                                                        
/var/lib/ceph/osd/ceph-80
                                                        fsck
                                                         ## After a
                                                        couple of
                                                        hours returned...
                                                        2019-07-03
                                                        18:30:02.095
                                                        7fe7c1c1ef00
                                                        -1
                                                        
bluestore(/var/lib/ceph/osd/ceph-80)
                                                        fsck warning:
                                                        legacy statfs
                                                        record found,
                                                        suggest to run
                                                        store repair
                                                        to get
                                                        consistent
                                                        statistic reports
                                                        fsck success


                                                        ## Error when
                                                        trying to
                                                        start one of
                                                        the OSDs
                                                         -12>
                                                        2019-07-03
                                                        18:36:57.450
                                                        7f5e42366700
                                                        -1 *** Caught
                                                        signal
                                                        (Aborted) **
                                                         in thread
                                                        7f5e42366700
                                                        thread_name:rocksdb:low0

                                                         ceph version
                                                        14.2.1
                                                        
(d555a9489eb35f84f2e1ef49b77e19da9d113972)
                                                        nautilus (stable)
                                                         1:
                                                        (()+0xf5d0)
                                                        [0x7f5e50bd75d0]
                                                         2:
                                                        (gsignal()+0x37)
                                                        [0x7f5e4f9ce207]
                                                         3:
                                                        (abort()+0x148)
                                                        [0x7f5e4f9cf8f8]
                                                         4:
                                                        
(ceph::__ceph_assert_fail(char
                                                        const*, char
                                                        const*, int,
                                                        char
                                                        const*)+0x199)
                                                        [0x55a7aaee96ab]
                                                         5:
                                                        
(ceph::__ceph_assertf_fail(char
                                                        const*, char
                                                        const*, int,
                                                        char const*,
                                                        char const*,
                                                        ...)+0)
                                                        [0x55a7aaee982a]
                                                         6:
                                                        (interval_set<unsigned
                                                        long,
                                                        std::map<unsigned
                                                        long, unsigned
                                                        long,
                                                        std::less<unsigned
                                                        long>,
                                                        
std::allocator<std::pair<unsigned
                                                        long const,
                                                        unsigned long>
                                                        > >
                                                        >::insert(unsigned
                                                        long, unsigned
                                                        long, unsigned
                                                        long*,
                                                        unsigned
                                                        long*)+0x3c6)
                                                        [0x55a7ab212a66]
                                                         7:
                                                        
(BlueStore::allocate_bluefs_freespace(unsigned
                                                        long, unsigned
                                                        long,
                                                        
std::vector<bluestore_pextent_t,
                                                        
mempool::pool_allocator<(mempool::pool_index_t)4,
                                                        bluestore_pextent_t>
                                                        >*)+0x74e)
                                                        [0x55a7ab48253e]
                                                         8:
                                                        
(BlueFS::_expand_slow_device(unsigned
                                                        long,
                                                        
std::vector<bluestore_pextent_t,
                                                        
mempool::pool_allocator<(mempool::pool_index_t)4,
                                                        bluestore_pextent_t>
                                                        >&)+0x111)
                                                        [0x55a7ab59e921]
                                                         9:
                                                        
(BlueFS::_allocate(unsigned
                                                        char, unsigned
                                                        long,
                                                        bluefs_fnode_t*)+0x68b)
                                                        [0x55a7ab59f68b]
                                                         10:
                                                        
(BlueFS::_flush_range(BlueFS::FileWriter*,
                                                        unsigned long,
                                                        unsigned
                                                        long)+0xe5)
                                                        [0x55a7ab59fce5]
                                                         11:
                                                        
(BlueFS::_flush(BlueFS::FileWriter*,
                                                        bool)+0x10b)
                                                        [0x55a7ab5a1b4b]
                                                         12:
                                                        
(BlueRocksWritableFile::Flush()+0x3d)
                                                        [0x55a7ab5bf84d]
                                                         13:
                                                        
(rocksdb::WritableFileWriter::Flush()+0x19e)
                                                        [0x55a7abbedd0e]
                                                         14:
                                                        
(rocksdb::WritableFileWriter::Sync(bool)+0x2e)
                                                        [0x55a7abbedfee]
                                                         15:
                                                        
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
                                                        const&,
                                                        
rocksdb::CompactionJob::SubcompactionState*,
                                                        
rocksdb::RangeDelAggregator*,
                                                        
CompactionIterationStats*,
                                                        rocksdb::Slice
                                                        const*)+0xbaa)
                                                        [0x55a7abc3b73a]
                                                         16:
                                                        
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
                                                        [0x55a7abc3f150]
                                                         17:
                                                        
(rocksdb::CompactionJob::Run()+0x298)
                                                        [0x55a7abc40618]
                                                         18:
                                                        
(rocksdb::DBImpl::BackgroundCompaction(bool*,
                                                        rocksdb::JobContext*,
                                                        rocksdb::LogBuffer*,
                                                        
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
                                                        [0x55a7aba7fb67]
                                                         19:
                                                        
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
                                                        
rocksdb::Env::Priority)+0xd0)
                                                        [0x55a7aba813c0]
                                                         20:
                                                        
(rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a)
                                                        [0x55a7aba8190a]
                                                         21:
                                                        
(rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned
                                                        long)+0x264)
                                                        [0x55a7abc8d9c4]
                                                         22:
                                                        
(rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
                                                        [0x55a7abc8db4f]
                                                         23:
                                                        (()+0x129dfff)
                                                        [0x55a7abd1afff]
                                                         24:
                                                        (()+0x7dd5)
                                                        [0x7f5e50bcfdd5]
                                                         25:
                                                        (clone()+0x6d)
                                                        [0x7f5e4fa95ead]
                                                         NOTE: a copy
                                                        of the
                                                        executable, or
                                                        `objdump -rdS
                                                        <executable>`
                                                        is needed to
                                                        interpret this.

                                                        
_______________________________________________
                                                        ceph-users mailing list
                                                        ceph-users@lists.ceph.com  
<mailto:ceph-users@lists.ceph.com>
                                                        
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


                                                    
_______________________________________________
                                                    ceph-users mailing list
                                                    ceph-users@lists.ceph.com  
<mailto:ceph-users@lists.ceph.com>
                                                    
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 3 OSDs stopped and unable to restart

Reply via email to