Re: [ceph-users] Giant upgrade - stability issues

Andrei Mikhailovsky Wed, 19 Nov 2014 13:40:01 -0800


>You indicated that osd 12 and 16 were the ones marked down, but it 
>looks like only 0,1,2,3,7 were marked down in the ceph.log you sent. 
>The logs for 12 and 16 did indicate that they had been partitioned 
>from the other nodes. I'd bet that you are having intermittent 
>network trouble since the heartbeats are intermittently failing. 
>-Sam


AM: I will check the logs further for the osds 12 and 16. Perhaps I've missed 
something, but the ceph osd tree output was showing 12 and 16 as down . 

Regarding the failure of heartbeats, Wido has suggested that I should 
investigate the reason for it's failure. The obvious thing to look at is the 
network and this is what I've initially done. However, I do not see any signs 
of the network issues. There are no errors on the physical interface and 
ifconfig is showing a very small number of TX dropped packets (0.00006%) and 0 
errors: 


# ifconfig ib0 
ib0 Link encap:UNSPEC HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 
inet addr:192.168.168.200 Bcast:192.168.168.255 Mask:255.255.255.0 
inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link 
UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 
RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0 
TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0 
collisions:0 txqueuelen:2048 
RX bytes:6252740293262 (6.2 TB) TX bytes:11343307665152 (11.3 TB) 


How would I investigate what is happening with the hearbeats and the reason for 
their failures? I have a suspetion that this will solve the issues with 
frequent reporting of degraded PGs on the cluster and intermittent high levels 
of IO wait on vms. 

And also, as i've previously mentioned, the issues started to happen after the 
upgrade to Giant. I've not had these problems with Firefly, Emperor or Dumpling 
releases on the same hardware and same cluster loads. 

Thanks 

Andrei 



On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky <and...@arhont.com> wrote: 
> Sam, 
> 
> Pastebin or similar will not take tens of megabytes worth of logs. If we are 
> talking about debug_ms 10 setting, I've got about 7gb worth of logs 
> generated every half an hour or so. Not really sure what to do with that 
> much data. Anything more constructive? 
> 
> Thanks 
> ________________________________ 
> From: "Samuel Just" <sam.j...@inktank.com> 
> To: "Andrei Mikhailovsky" <and...@arhont.com> 
> Cc: ceph-users@lists.ceph.com 
> Sent: Tuesday, 18 November, 2014 8:53:47 PM 
> 
> Subject: Re: [ceph-users] Giant upgrade - stability issues 
> 
> pastebin or something, probably. 
> -Sam 
> 
> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <and...@arhont.com> 
> wrote: 
>> Sam, the logs are rather large in size. Where should I post it to? 
>> 
>> Thanks 
>> ________________________________ 
>> From: "Samuel Just" <sam.j...@inktank.com> 
>> To: "Andrei Mikhailovsky" <and...@arhont.com> 
>> Cc: ceph-users@lists.ceph.com 
>> Sent: Tuesday, 18 November, 2014 7:54:56 PM 
>> Subject: Re: [ceph-users] Giant upgrade - stability issues 
>> 
>> 
>> Ok, why is ceph marking osds down? Post your ceph.log from one of the 
>> problematic periods. 
>> -Sam 
>> 
>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <and...@arhont.com> 
>> wrote: 
>>> Hello cephers, 
>>> 
>>> I need your help and suggestion on what is going on with my cluster. A 
>>> few 
>>> weeks ago i've upgraded from Firefly to Giant. I've previously written 
>>> about 
>>> having issues with Giant where in two weeks period the cluster's IO froze 
>>> three times after ceph down-ed two osds. I have in total just 17 osds 
>>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 
>>> with 
>>> latest updates. 
>>> 
>>> I've got zabbix agents monitoring the osd servers and the cluster. I get 
>>> alerts of any issues, such as problems with PGs, etc. Since upgrading to 
>>> Giant, I am now frequently seeing emails alerting of the cluster having 
>>> degraded PGs. I am getting around 10-15 such emails per day stating that 
>>> the 
>>> cluster has degraded PGs. The number of degraded PGs very between a 
>>> couple 
>>> of PGs to over a thousand. After several minutes the cluster repairs 
>>> itself. 
>>> The total number of PGs in the cluster is 4412 between all the pools. 
>>> 
>>> I am also seeing more alerts from vms stating that there is a high IO 
>>> wait 
>>> and also seeing hang tasks. Some vms reporting over 50% io wait. 
>>> 
>>> This has not happened on Firefly or the previous releases of ceph. Not 
>>> much 
>>> has changed in the cluster since the upgrade to Giant. Networking and 
>>> hardware is still the same and it is still running the same version of 
>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the 
>>> issues 
>>> above are related to the upgrade of ceph to Giant. 
>>> 
>>> Here is the ceph.conf that I use: 
>>> 
>>> [global] 
>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe 
>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, 
>>> arh-cloud13-ib 
>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 
>>> auth_supported = cephx 
>>> osd_journal_size = 10240 
>>> filestore_xattr_use_omap = true 
>>> public_network = 192.168.168.0/24 
>>> rbd_default_format = 2 
>>> osd_recovery_max_chunk = 8388608 
>>> osd_recovery_op_priority = 1 
>>> osd_max_backfills = 1 
>>> osd_recovery_max_active = 1 
>>> osd_recovery_threads = 1 
>>> filestore_max_sync_interval = 15 
>>> filestore_op_threads = 8 
>>> filestore_merge_threshold = 40 
>>> filestore_split_multiple = 8 
>>> osd_disk_threads = 8 
>>> osd_op_threads = 8 
>>> osd_pool_default_pg_num = 1024 
>>> osd_pool_default_pgp_num = 1024 
>>> osd_crush_update_on_start = false 
>>> 
>>> [client] 
>>> rbd_cache = true 
>>> admin_socket = /var/run/ceph/$name.$pid.asok 
>>> 
>>> 
>>> I would like to get to the bottom of these issues. Not sure if the issues 
>>> could be fixed with changing some settings in ceph.conf or a full 
>>> downgrade 
>>> back to the Firefly. Is the downgrade even possible on a production 
>>> cluster? 
>>> 
>>> Thanks for your help 
>>> 
>>> Andrei 
>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 
>> 
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant upgrade - stability issues

Reply via email to