Hello,

The last 3 days I worked at a customer with a 1800 OSD cluster which had to be 
upgraded from Hammer 0.94.5 to Jewel 10.2.2

The cluster in this case is 99% RGW, but also some RBD.

I wanted to share some of the things we encountered during this upgrade.

All 180 nodes are running CentOS 7.1 on a IPv6-only network.

** Hammer Upgrade **
At first we upgraded from 0.94.5 to 0.94.7, this went well except for the fact 
that the monitors got spammed with these kind of messages:

  "Failed to encode map eXXX with expected crc"

Some searching on the list brought me to:

  ceph tell osd.* injectargs -- --clog_to_monitors=false
  
 This reduced the load on the 5 monitors and made recovery succeed smoothly.
 
 ** Monitors to Jewel **
 The next step was to upgrade the monitors from Hammer to Jewel.
 
 Using Salt we upgraded the packages and afterwards it was simple:
 
   killall ceph-mon
   chown -R ceph:ceph /var/lib/ceph
   chown -R ceph:ceph /var/log/ceph

Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
manually enabled the monitor and start it:

  systemctl enable ceph-mon@srv-zmb04-05.service
  systemctl start ceph-mon@srv-zmb04-05.service

Afterwards the monitors were running just fine.

** OSDs to Jewel **
To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
all systems to 10.2.2, we then used a Shell script which we ran on one node at 
a time.

The failure domain here is 'rack', so we executed this in one rack, then the 
next one, etc, etc.

Script can be found on Github: 
https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6

Be aware that the chown can take a long, long, very long time!

We ran into the issue that some OSDs crashed after start. But after trying 
again they would start.

  "void FileStore::init_temp_collections()"
  
I reported this in the tracker as I'm not sure what is happening here: 
http://tracker.ceph.com/issues/16672

** New OSDs with Jewel **
We also had some new nodes which we wanted to add to the Jewel cluster.

Using Salt and ceph-disk we ran into a partprobe issue in combination with 
ceph-disk. There was already a Pull Request for the fix, but that was not 
included in Jewel 10.2.2.

We manually applied the PR and it fixed our issues: 
https://github.com/ceph/ceph/pull/9330

Hope this helps other people with their upgrades to Jewel!

Wido
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to