Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery

Muthusamy Muthiah Tue, 28 Mar 2017 13:30:33 -0700

Hi wido,
wip-prune-past-intervals is the patch developed in jewel for fast osd map
update and might be available in luminos release.


Regards,
Muthu


On Monday, 27 March 2017, Wido den Hollander <w...@42on.com> wrote:

>
> > Op 27 maart 2017 om 8:41 schreef Muthusamy Muthiah <
> muthiah.muthus...@gmail.com <javascript:;>>:
> >
> >
> > Hi Wido,
> >
> > Yes slow map update was happening and CPU hitting 100%.
>
> So it indeed seems you are CPU bound at that moment. That's indeed a
> problem when you have a lot of map changes to work through on the OSDs.
>
> It's recommended to have 1 CPU core per OSD as during recovery/boot this
> power is needed badly by the OSDs.
>
> > We also tried to set noup flag to true so that the cluster osdmap
> remained
> > in same version . This made each OSD updated to the current map slowly .
> At
> > one point we lost patience due to critical timelines and re-insalled the
> > cluster. However we plan to do this recovery again and find optimum
> > procedure for recovery .
>
> The noup flag can indeed 'help' here to prevent new maps from being
> produced.
>
> > Sage was commenting that there is another solution available in Luminous
> > which would recover the OSDs at much faster rate than the current one by
> > skipping some maps instead of going in sequential way.
>
> I am not aware of those improvements. Sage (or another dev) would need to
> comment on that.
>
> Wido
>
> >
> > Thanks,
> > Muthu
> >
> > On 20 March 2017 at 22:13, Wido den Hollander <w...@42on.com
> <javascript:;>> wrote:
> >
> > >
> > > > Op 18 maart 2017 om 10:39 schreef Muthusamy Muthiah <
> > > muthiah.muthus...@gmail.com <javascript:;>>:
> > > >
> > > >
> > > > Hi,
> > > >
> > > > We had similar issue on one of the 5 node cluster cluster again
> during
> > > > recovery(200/335 OSDs are to be recovered)  , we see a lot of
> differences
> > > > in the OSDmap epocs between OSD which is booting and the current one
> same
> > > > is below,
> > > >
> > > > -          In the current situation the OSD are trying to register
> with
> > > an
> > > > old OSDMAP version *7620 * but he current version in the cluster is
> > > > higher  *13102
> > > > *version – as a result it takes longer for OSD to update to this
> version
> > > ..
> > > >
> > >
> > > Do you see these OSDs eating 100% CPU at that moment? Eg, could it be
> that
> > > the CPUs are not fast enough to process all the map updates quick
> enough.
> > >
> > > iirc map updates are not processed multi-threaded.
> > >
> > > Wido
> > >
> > > >
> > > > We also see 2017-03-18 09:19:04.628206 7f2056735700 0 --
> > > > 10.139.4.69:6836/777372 >> - conn(0x7f20c1bfa800 :6836
> > > > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
> nothing to
> > > > send and in the half accept state just closed messages on many osds
> which
> > > > are recovering.
> > > >
> > > > Suggestions would be helpful.
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Muthu
> > > >
> > > > On 13 February 2017 at 18:14, Wido den Hollander <w...@42on.com
> <javascript:;>> wrote:
> > > >
> > > > >
> > > > > > Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah <
> > > > > muthiah.muthus...@gmail.com <javascript:;>>:
> > > > > >
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > We also have same issue on one of our platforms which was
> upgraded
> > > from
> > > > > > 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU
> hits
> > > 100%
> > > > > > and OSDs of that node marked down. Issue not seen on cluster
> which
> > > was
> > > > > > installed from scratch with 11.2.0.
> > > > > >
> > > > >
> > > > > How many maps is this OSD behind?
> > > > >
> > > > > Does it help if you set the nodown flag for a moment to let it
> catch
> > > up?
> > > > >
> > > > > Wido
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > *[r...@cn3.c7.vna ~] # systemctl start ceph-osd@315.service
> > > > > > <ceph-osd@315.service> [r...@cn3.c7.vna ~] # cd /var/log/ceph/
> > > > > > [r...@cn3.c7.vna ceph] # tail -f *osd*315.log 2017-02-13
> > > 11:29:46.752897
> > > > > > 7f995c79b940  0 <cls>
> > > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
> > > > > 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
> > > > > centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/
> > > > > ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
> > > > > > loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940  0
> > > _get_class
> > > > > not
> > > > > > permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940  0
> > > > > _get_class
> > > > > > not permitted to load lua 2017-02-13 11:29:47.058720
> 7f995c79b940  0
> > > > > > osd.315 44703 crush map has features 288514119978713088,
> adjusting
> > > msgr
> > > > > > requires for clients 2017-02-13 11:29:47.058728 7f995c79b940  0
> > > osd.315
> > > > > > 44703 crush map has features 288514394856620032 was 8705,
> adjusting
> > > msgr
> > > > > > requires for mons 2017-02-13 11:29:47.058732 7f995c79b940  0
> osd.315
> > > > > 44703
> > > > > > crush map has features 288531987042664448, adjusting msgr
> requires
> > > for
> > > > > osds
> > > > > > 2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs
> > > > > > 2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs
> > > opened
> > > > > > 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703
> > > using 1
> > > > > op
> > > > > > queue with priority op cut off at 64. 2017-02-13 11:29:55.914102
> > > > > > 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true}
> > > 2017-02-13
> > > > > > 11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout
> 'tp_osd
> > > > > thread
> > > > > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073336
> > > 7f9955a2b700  1
> > > > > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out
> after
> > > 15
> > > > > > 2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map
> is_healthy
> > > > > 'tp_osd
> > > > > > thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344
> > > > > > 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread tp_osd'
> had
> > > timed
> > > > > > out after 15 2017-02-13 11:30:31.073345 7f9955a2b700  1
> heartbeat_map
> > > > > > is_healthy 'tp_osd thread tp_osd' had timed out after 15
> 2017-02-13
> > > > > > 11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd
> > > thread
> > > > > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073348
> > > 7f9955a2b700  1
> > > > > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out
> after
> > > > > > 152017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done
> with
> > > > > init,
> > > > > > starting boot process*
> > > > > >
> > > > > >
> > > > > > *Thanks,*
> > > > > > *Muthu*
> > > > > >
> > > > > > On 13 February 2017 at 10:50, Andreas Gerstmayr <
> > > > > andreas.gerstm...@gmail.com <javascript:;>
> > > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our
> test
> > > > > > > cluster is unhealthy since about two weeks and can't recover
> itself
> > > > > > > anymore (unfortunately I skipped the upgrade to 10.2.5 because
> I
> > > > > > > missed the ".z" in "All clusters must first be upgraded to
> Jewel
> > > > > > > 10.2.z").
> > > > > > >
> > > > > > > Immediately after the upgrade I saw the following in the OSD
> logs:
> > > > > > > s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with
> > > nothing
> > > > > > > to send and in the half  accept state just closed
> > > > > > >
> > > > > > > There are also missed heartbeats in the OSD logs, and the OSDs
> > > which
> > > > > > > don't send heartbeats have the following in their logs:
> > > > > > > 2017-02-08 19:44:51.367828 7f9be8c37700  1 heartbeat_map
> is_healthy
> > > > > > > 'tp_osd thread tp_osd' had timed out after 15
> > > > > > > 2017-02-08 19:44:54.271010 7f9bc4e96700  1 heartbeat_map
> > > reset_timeout
> > > > > > > 'tp_osd thread tp_osd' had timed out after 15
> > > > > > >
> > > > > > > During investigating we found out that some OSDs were lagging
> about
> > > > > > > 100-20000 OSD map epochs behind. The monitor publishes new
> epochs
> > > > > > > every few seconds, but the OSD daemons are pretty slow in
> applying
> > > > > > > them (up to a few minutes for 100 epochs). During recovery of
> the
> > > 24
> > > > > > > OSDs of a storage node the CPU is running at almost 100% (the
> nodes
> > > > > > > have 16 real cores, or 32 with Hyper-Threading).
> > > > > > >
> > > > > > > We had at times servers where all 24 OSDs were up-to-date with
> the
> > > > > > > latest OSD map, but somehow they lost it and were lagging
> behind
> > > > > > > again. During recovery some OSDs used up to 25 GB of RAM, which
> > > led to
> > > > > > > out of memory and further lagging of the OSDs of the affected
> > > server.
> > > > > > >
> > > > > > > We already set the nodown, noout, norebalance, nobackfill,
> > > norecover,
> > > > > > > noscrub and nodeep-scrub flags to prevent OSD flapping and even
> > > more
> > > > > > > new OSD epochs.
> > > > > > >
> > > > > > > Is there anything we can do to let the OSDs recover? It seems
> that
> > > the
> > > > > > > servers don't have enough CPU resources for recovery. I already
> > > played
> > > > > > > around with the osd map message max setting (when I increased
> it to
> > > > > > > 1000 to speed up recovery, the OSDs didn't get any updates at
> > > all?),
> > > > > > > and the osd heartbeat grace and osd thread timeout settings (to
> > > give
> > > > > > > the overloaded server more time), but without success so far.
> I've
> > > > > > > seen errors related to the AsyncMessenger in the logs, so I
> > > reverted
> > > > > > > back to the SimpleMessenger (which was working successfully
> with
> > > > > > > Jewel).
> > > > > > >
> > > > > > >
> > > > > > > Cluster details:
> > > > > > > 6 storage nodes with 2x Intel Xeon E5-2630 v3 8x2.40GHz
> > > > > > > 256GB RAM
> > > > > > > Each storage node has 24 HDDs attached, one OSD per disk,
> journal
> > > on
> > > > > same
> > > > > > > disk
> > > > > > > 3 monitors in total, co-located with the storage nodes
> > > > > > > separate front and back network (10 Gbit)
> > > > > > > OS: CentOS 7.2.1511
> > > > > > > Kernel: 4.9.8-1.el7.elrepo.x86_64 from elrepo.org
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Andreas
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list
> > > > > > > ceph-users@lists.ceph.com <javascript:;>
> > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > ceph-users@lists.ceph.com <javascript:;>
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery

Reply via email to