good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander <w...@42on.com>:

> To add, the RGWs upgraded just fine as well.
>
> No regions in use here (yet!), so that upgraded as it should.
>
> Wido
>
> > Op 13 juli 2016 om 16:56 schreef Wido den Hollander <w...@42on.com>:
> >
> >
> > Hello,
> >
> > The last 3 days I worked at a customer with a 1800 OSD cluster which had
> to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> >
> > The cluster in this case is 99% RGW, but also some RBD.
> >
> > I wanted to share some of the things we encountered during this upgrade.
> >
> > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> >
> > ** Hammer Upgrade **
> > At first we upgraded from 0.94.5 to 0.94.7, this went well except for
> the fact that the monitors got spammed with these kind of messages:
> >
> >   "Failed to encode map eXXX with expected crc"
> >
> > Some searching on the list brought me to:
> >
> >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> >
> >  This reduced the load on the 5 monitors and made recovery succeed
> smoothly.
> >
> >  ** Monitors to Jewel **
> >  The next step was to upgrade the monitors from Hammer to Jewel.
> >
> >  Using Salt we upgraded the packages and afterwards it was simple:
> >
> >    killall ceph-mon
> >    chown -R ceph:ceph /var/lib/ceph
> >    chown -R ceph:ceph /var/log/ceph
> >
> > Now, a systemd quirck. 'systemctl start ceph.target' does not work, I
> had to manually enabled the monitor and start it:
> >
> >   systemctl enable ceph-mon@srv-zmb04-05.service
> >   systemctl start ceph-mon@srv-zmb04-05.service
> >
> > Afterwards the monitors were running just fine.
> >
> > ** OSDs to Jewel **
> > To upgrade the OSDs to Jewel we initially used Salt to update the
> packages on all systems to 10.2.2, we then used a Shell script which we ran
> on one node at a time.
> >
> > The failure domain here is 'rack', so we executed this in one rack, then
> the next one, etc, etc.
> >
> > Script can be found on Github:
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> >
> > Be aware that the chown can take a long, long, very long time!
> >
> > We ran into the issue that some OSDs crashed after start. But after
> trying again they would start.
> >
> >   "void FileStore::init_temp_collections()"
> >
> > I reported this in the tracker as I'm not sure what is happening here:
> http://tracker.ceph.com/issues/16672
> >
> > ** New OSDs with Jewel **
> > We also had some new nodes which we wanted to add to the Jewel cluster.
> >
> > Using Salt and ceph-disk we ran into a partprobe issue in combination
> with ceph-disk. There was already a Pull Request for the fix, but that was
> not included in Jewel 10.2.2.
> >
> > We manually applied the PR and it fixed our issues:
> https://github.com/ceph/ceph/pull/9330
> >
> > Hope this helps other people with their upgrades to Jewel!
> >
> > Wido
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to