On Sat, Mar 11, 2017 at 12:21 PM, <cephmailingl...@mosibi.nl> wrote:
>
> The next and biggest problem we encountered had to do with the CRC errors on 
> the OSD map. On every map update, the OSDs that were not upgraded yet, got 
> that CRC error and asked the monitor for a full OSD map instead of just a 
> delta update. At first we did not understand what exactly happened, we ran 
> the upgrade per node using a script and in that script we watch the state of 
> the cluster and when the cluster is healthy again, we upgrade the next host. 
> Every time we started the script (skipping the already upgraded hosts) the 
> first host(s) upgraded without issues and then we got blocked I/O on the 
> cluster. The blocked I/O went away within a minute of 2 (not measured). After 
> investigation we found out that the blocked I/O happened when nodes where 
> asking the monitor for a (full) OSD map and that resulted shortly in a full 
> saturated network link on our monitor.


Thanks for the detailed upgrade report. I wanted to zoom in on this
CRC/fullmap issue because it could be quite disruptive for us when we
upgrade from hammer to jewel.

I've read various reports that the fool proof way to avoid the full
map DoS would be to upgrade all OSDs to jewel before the mon's.
Did anyone have success with that workaround? I'm cc'ing Bryan because
he knows this issue very well.

Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to