Hello cephers, 

I need your help and suggestion on what is going on with my cluster. A few 
weeks ago i've upgraded from Firefly to Giant. I've previously written about 
having issues with Giant where in two weeks period the cluster's IO froze three 
times after ceph down-ed two osds. I have in total just 17 osds between two osd 
servers, 3 mons. The cluster is running on Ubuntu 12.04 with latest updates. 

I've got zabbix agents monitoring the osd servers and the cluster. I get alerts 
of any issues, such as problems with PGs, etc. Since upgrading to Giant, I am 
now frequently seeing emails alerting of the cluster having degraded PGs. I am 
getting around 10-15 such emails per day stating that the cluster has degraded 
PGs. The number of degraded PGs very between a couple of PGs to over a 
thousand. After several minutes the cluster repairs itself. The total number of 
PGs in the cluster is 4412 between all the pools. 

I am also seeing more alerts from vms stating that there is a high IO wait and 
also seeing hang tasks. Some vms reporting over 50% io wait. 

This has not happened on Firefly or the previous releases of ceph. Not much has 
changed in the cluster since the upgrade to Giant. Networking and hardware is 
still the same and it is still running the same version of Ubuntu OS. The 
cluster load hasn't changed as well. Thus, I think the issues above are related 
to the upgrade of ceph to Giant. 

Here is the ceph.conf that I use: 

[global] 
fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe 
mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib 
mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 
auth_supported = cephx 
osd_journal_size = 10240 
filestore_xattr_use_omap = true 
public_network = 192.168.168.0/24 
rbd_default_format = 2 
osd_recovery_max_chunk = 8388608 
osd_recovery_op_priority = 1 
osd_max_backfills = 1 
osd_recovery_max_active = 1 
osd_recovery_threads = 1 
filestore_max_sync_interval = 15 
filestore_op_threads = 8 
filestore_merge_threshold = 40 
filestore_split_multiple = 8 
osd_disk_threads = 8 
osd_op_threads = 8 
osd_pool_default_pg_num = 1024 
osd_pool_default_pgp_num = 1024 
osd_crush_update_on_start = false 

[client] 
rbd_cache = true 
admin_socket = /var/run/ceph/$name.$pid.asok 


I would like to get to the bottom of these issues. Not sure if the issues could 
be fixed with changing some settings in ceph.conf or a full downgrade back to 
the Firefly. Is the downgrade even possible on a production cluster? 

Thanks for your help 

Andrei 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to