Sam, I've done more network testing, this time over 2 days and I believe I have enough evidence to conclude that the osd disconnects are not caused by the network. I have ran about 140 million TCP connects on each osd and host server over the course of about two days. Generating about 800-900 connections per seconds. I've not had a single error/packet drop and the latency / standard deviation was very minimal.
While the tests were running I did see a number of osds being marked as down by other osds. According to the logs it happened at least 3 times in the course of two days. However, this time the cluster IO was available. The osds simply connected back with the message that they were wrongly marked down. I was not able to set the full debug logging on the cluster as it would have consumed the disk space in less than 30 mins. So I am not really sure how to debug this particular problem. What I have done is I have rebooted both osd servers and so far I've not see any osd disconnects. The servers are up 3 days already. Perhaps the problem could be down to the kernel stability, but if this was the case, I would have seen similar issues on Firefly, which I did not. Not sure what to think now. Andrei ----- Original Message ----- > From: "Andrei Mikhailovsky" <and...@arhont.com> > To: sj...@redhat.com > Cc: ceph-users@lists.ceph.com > Sent: Thursday, 20 November, 2014 4:50:21 PM > Subject: Re: [ceph-users] Giant upgrade - stability issues > Thanks, I will try that. > Andrei > ----- Original Message ----- > From: "Samuel Just" <sam.j...@inktank.com> > To: "Andrei Mikhailovsky" <and...@arhont.com> > Cc: ceph-users@lists.ceph.com > Sent: Thursday, 20 November, 2014 4:26:00 PM > Subject: Re: [ceph-users] Giant upgrade - stability issues > You can try to capture logging at > debug osd = 20 > debug ms = 20 > debug filestore = 20 > while an osd is misbehaving. > -Sam > On Thu, Nov 20, 2014 at 7:34 AM, Andrei Mikhailovsky > <and...@arhont.com> wrote: > > Sam, > > > > further to your email I have done the following: > > > > 1. Upgraded both osd servers with the latest updates and restarted > > each > > server in turn > > 2. fired up nping utility to generate TCP connections (3 way > > handshake) from > > each of the servers as well as from the host servers. In total i've > > ran 5 > > tests. The nping utility was establishing connects on port 22 (as > > all > > servers have this port open) with the delay of 1ms. The command > > used to > > generate the traffic was as follows: > > > > nping --tcp-connect -p 22 --delay 1ms <hostname> -v2 -c 36000000 | > > gzip > >>/root/nping-hostname-output.gz > > > > The tests took just over 12 hours to complete. The results did not > > show any > > problems as far as I can see. Here is the tailed output of one of > > the > > findings: > > > > > > SENT (37825.7303s) Starting TCP Handshake > arh-ibstorage1-ib:22 > > (192.168.168.200:22) > > RECV (37825.7303s) Handshake with arh-ibstorage1-ib:22 > > (192.168.168.200:22) > > completed > > > > Max rtt: 4.447ms | Min rtt: 0.008ms | Avg rtt: 0.008ms > > TCP connection attempts: 36000000 | Successful connections: > > 36000000 | > > Failed: 0 (0.00%) > > Tx time: 37825.72833s | Tx bytes/s: 76138.65 | Tx pkts/s: 951.73 > > Rx time: 37825.72939s | Rx bytes/s: 38069.33 | Rx pkts/s: 951.73 > > Nping done: 1 IP address pinged in 37844.55 seconds > > > > > > As you can see from the above, there are no failed connects at all > > from the > > 36 million established connections. The average delay is 0.008ms > > and it was > > sending on average almost 1000 packets per second. I've got the > > same results > > from other servers. > > > > Unless you have other tests in mind, I think there are no issues > > with the > > network. > > > > I fire up another test for 24 hours this time to see if it makes a > > difference. > > > > Thanks > > > > Andrei > > > > > > ________________________________ > > From: "Samuel Just" <sam.j...@inktank.com> > > To: "Andrei Mikhailovsky" <and...@arhont.com> > > Cc: ceph-users@lists.ceph.com > > Sent: Wednesday, 19 November, 2014 9:45:40 PM > > > > Subject: Re: [ceph-users] Giant upgrade - stability issues > > > > Well, the heartbeats are failing due to networking errors > > preventing > > the heartbeats from arriving. That is causing osds to go down, and > > that is causing pgs to become degraded. You'll have to work out > > what > > is preventing the tcp connections from being stable. > > -Sam > > > > On Wed, Nov 19, 2014 at 1:39 PM, Andrei Mikhailovsky > > <and...@arhont.com> > > wrote: > >> > >>>You indicated that osd 12 and 16 were the ones marked down, but it > >>>looks like only 0,1,2,3,7 were marked down in the ceph.log you > >>>sent. > >>>The logs for 12 and 16 did indicate that they had been partitioned > >>>from the other nodes. I'd bet that you are having intermittent > >>>network trouble since the heartbeats are intermittently failing. > >>>-Sam > >> > >> AM: I will check the logs further for the osds 12 and 16. Perhaps > >> I've > >> missed something, but the ceph osd tree output was showing 12 and > >> 16 as > >> down. > >> > >> Regarding the failure of heartbeats, Wido has suggested that I > >> should > >> investigate the reason for it's failure. The obvious thing to look > >> at is > >> the > >> network and this is what I've initially done. However, I do not > >> see any > >> signs of the network issues. There are no errors on the physical > >> interface > >> and ifconfig is showing a very small number of TX dropped packets > >> (0.00006%) > >> and 0 errors: > >> > >> > >> # ifconfig ib0 > >> ib0 Link encap:UNSPEC HWaddr > >> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 > >> inet addr:192.168.168.200 Bcast:192.168.168.255 > >> Mask:255.255.255.0 > >> inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link > >> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > >> RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0 > >> TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0 > >> collisions:0 txqueuelen:2048 > >> RX bytes:6252740293262 (6.2 TB) TX bytes:11343307665152 (11.3 > >> TB) > >> > >> > >> How would I investigate what is happening with the hearbeats and > >> the > >> reason > >> for their failures? I have a suspetion that this will solve the > >> issues > >> with > >> frequent reporting of degraded PGs on the cluster and intermittent > >> high > >> levels of IO wait on vms. > >> > >> And also, as i've previously mentioned, the issues started to > >> happen after > >> the upgrade to Giant. I've not had these problems with Firefly, > >> Emperor or > >> Dumpling releases on the same hardware and same cluster loads. > >> > >> Thanks > >> > >> Andrei > >> > >> > >> > >> > >> On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky > >> <and...@arhont.com> > >> wrote: > >>> Sam, > >>> > >>> Pastebin or similar will not take tens of megabytes worth of > >>> logs. If we > >>> are > >>> talking about debug_ms 10 setting, I've got about 7gb worth of > >>> logs > >>> generated every half an hour or so. Not really sure what to do > >>> with that > >>> much data. Anything more constructive? > >>> > >>> Thanks > >>> ________________________________ > >>> From: "Samuel Just" <sam.j...@inktank.com> > >>> To: "Andrei Mikhailovsky" <and...@arhont.com> > >>> Cc: ceph-users@lists.ceph.com > >>> Sent: Tuesday, 18 November, 2014 8:53:47 PM > >>> > >>> Subject: Re: [ceph-users] Giant upgrade - stability issues > >>> > >>> pastebin or something, probably. > >>> -Sam > >>> > >>> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky > >>> <and...@arhont.com> > >>> wrote: > >>>> Sam, the logs are rather large in size. Where should I post it > >>>> to? > >>>> > >>>> Thanks > >>>> ________________________________ > >>>> From: "Samuel Just" <sam.j...@inktank.com> > >>>> To: "Andrei Mikhailovsky" <and...@arhont.com> > >>>> Cc: ceph-users@lists.ceph.com > >>>> Sent: Tuesday, 18 November, 2014 7:54:56 PM > >>>> Subject: Re: [ceph-users] Giant upgrade - stability issues > >>>> > >>>> > >>>> Ok, why is ceph marking osds down? Post your ceph.log from one > >>>> of the > >>>> problematic periods. > >>>> -Sam > >>>> > >>>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky > >>>> <and...@arhont.com> > >>>> wrote: > >>>>> Hello cephers, > >>>>> > >>>>> I need your help and suggestion on what is going on with my > >>>>> cluster. A > >>>>> few > >>>>> weeks ago i've upgraded from Firefly to Giant. I've previously > >>>>> written > >>>>> about > >>>>> having issues with Giant where in two weeks period the > >>>>> cluster's IO > >>>>> froze > >>>>> three times after ceph down-ed two osds. I have in total just > >>>>> 17 osds > >>>>> between two osd servers, 3 mons. The cluster is running on > >>>>> Ubuntu 12.04 > >>>>> with > >>>>> latest updates. > >>>>> > >>>>> I've got zabbix agents monitoring the osd servers and the > >>>>> cluster. I > >>>>> get > >>>>> alerts of any issues, such as problems with PGs, etc. Since > >>>>> upgrading > >>>>> to > >>>>> Giant, I am now frequently seeing emails alerting of the > >>>>> cluster having > >>>>> degraded PGs. I am getting around 10-15 such emails per day > >>>>> stating > >>>>> that > >>>>> the > >>>>> cluster has degraded PGs. The number of degraded PGs very > >>>>> between a > >>>>> couple > >>>>> of PGs to over a thousand. After several minutes the cluster > >>>>> repairs > >>>>> itself. > >>>>> The total number of PGs in the cluster is 4412 between all the > >>>>> pools. > >>>>> > >>>>> I am also seeing more alerts from vms stating that there is a > >>>>> high IO > >>>>> wait > >>>>> and also seeing hang tasks. Some vms reporting over 50% io > >>>>> wait. > >>>>> > >>>>> This has not happened on Firefly or the previous releases of > >>>>> ceph. Not > >>>>> much > >>>>> has changed in the cluster since the upgrade to Giant. > >>>>> Networking and > >>>>> hardware is still the same and it is still running the same > >>>>> version of > >>>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I > >>>>> think the > >>>>> issues > >>>>> above are related to the upgrade of ceph to Giant. > >>>>> > >>>>> Here is the ceph.conf that I use: > >>>>> > >>>>> [global] > >>>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe > >>>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, > >>>>> arh-cloud13-ib > >>>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 > >>>>> auth_supported = cephx > >>>>> osd_journal_size = 10240 > >>>>> filestore_xattr_use_omap = true > >>>>> public_network = 192.168.168.0/24 > >>>>> rbd_default_format = 2 > >>>>> osd_recovery_max_chunk = 8388608 > >>>>> osd_recovery_op_priority = 1 > >>>>> osd_max_backfills = 1 > >>>>> osd_recovery_max_active = 1 > >>>>> osd_recovery_threads = 1 > >>>>> filestore_max_sync_interval = 15 > >>>>> filestore_op_threads = 8 > >>>>> filestore_merge_threshold = 40 > >>>>> filestore_split_multiple = 8 > >>>>> osd_disk_threads = 8 > >>>>> osd_op_threads = 8 > >>>>> osd_pool_default_pg_num = 1024 > >>>>> osd_pool_default_pgp_num = 1024 > >>>>> osd_crush_update_on_start = false > >>>>> > >>>>> [client] > >>>>> rbd_cache = true > >>>>> admin_socket = /var/run/ceph/$name.$pid.asok > >>>>> > >>>>> > >>>>> I would like to get to the bottom of these issues. Not sure if > >>>>> the > >>>>> issues > >>>>> could be fixed with changing some settings in ceph.conf or a > >>>>> full > >>>>> downgrade > >>>>> back to the Firefly. Is the downgrade even possible on a > >>>>> production > >>>>> cluster? > >>>>> > >>>>> Thanks for your help > >>>>> > >>>>> Andrei > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@lists.ceph.com > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>> > >>>> > >>> > >> > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com