One more thing I've missed to say. All failures that i've seen happen when there is a deep scrubbing process running.
Andrei ----- Original Message ----- > From: "Andrei Mikhailovsky" <and...@arhont.com> > To: sj...@redhat.com > Cc: ceph-users@lists.ceph.com > Sent: Thursday, 27 November, 2014 1:05:30 PM > Subject: Re: [ceph-users] Giant upgrade - stability issues > Sam, > I've done more network testing, this time over 2 days and I believe I > have enough evidence to conclude that the osd disconnects are not > caused by the network. I have ran about 140 million TCP connects on > each osd and host server over the course of about two days. > Generating about 800-900 connections per seconds. I've not had a > single error/packet drop and the latency / standard deviation was > very minimal. > While the tests were running I did see a number of osds being marked > as down by other osds. According to the logs it happened at least 3 > times in the course of two days. However, this time the cluster IO > was available. The osds simply connected back with the message that > they were wrongly marked down. > I was not able to set the full debug logging on the cluster as it > would have consumed the disk space in less than 30 mins. So I am not > really sure how to debug this particular problem. > What I have done is I have rebooted both osd servers and so far I've > not see any osd disconnects. The servers are up 3 days already. > Perhaps the problem could be down to the kernel stability, but if > this was the case, I would have seen similar issues on Firefly, > which I did not. Not sure what to think now. > Andrei > ----- Original Message ----- > > From: "Andrei Mikhailovsky" <and...@arhont.com> > > > To: sj...@redhat.com > > > Cc: ceph-users@lists.ceph.com > > > Sent: Thursday, 20 November, 2014 4:50:21 PM > > > Subject: Re: [ceph-users] Giant upgrade - stability issues > > > Thanks, I will try that. > > > Andrei > > > ----- Original Message ----- > > > From: "Samuel Just" <sam.j...@inktank.com> > > > To: "Andrei Mikhailovsky" <and...@arhont.com> > > > Cc: ceph-users@lists.ceph.com > > > Sent: Thursday, 20 November, 2014 4:26:00 PM > > > Subject: Re: [ceph-users] Giant upgrade - stability issues > > > You can try to capture logging at > > > debug osd = 20 > > > debug ms = 20 > > > debug filestore = 20 > > > while an osd is misbehaving. > > > -Sam > > > On Thu, Nov 20, 2014 at 7:34 AM, Andrei Mikhailovsky > > <and...@arhont.com> wrote: > > > > Sam, > > > > > > > > further to your email I have done the following: > > > > > > > > 1. Upgraded both osd servers with the latest updates and > > > restarted > > > each > > > > server in turn > > > > 2. fired up nping utility to generate TCP connections (3 way > > > handshake) from > > > > each of the servers as well as from the host servers. In total > > > i've > > > ran 5 > > > > tests. The nping utility was establishing connects on port 22 (as > > > all > > > > servers have this port open) with the delay of 1ms. The command > > > used to > > > > generate the traffic was as follows: > > > > > > > > nping --tcp-connect -p 22 --delay 1ms <hostname> -v2 -c 36000000 > > > | > > > gzip > > > >>/root/nping-hostname-output.gz > > > > > > > > The tests took just over 12 hours to complete. The results did > > > not > > > show any > > > > problems as far as I can see. Here is the tailed output of one of > > > the > > > > findings: > > > > > > > > > > > > SENT (37825.7303s) Starting TCP Handshake > arh-ibstorage1-ib:22 > > > > (192.168.168.200:22) > > > > RECV (37825.7303s) Handshake with arh-ibstorage1-ib:22 > > > (192.168.168.200:22) > > > > completed > > > > > > > > Max rtt: 4.447ms | Min rtt: 0.008ms | Avg rtt: 0.008ms > > > > TCP connection attempts: 36000000 | Successful connections: > > > 36000000 | > > > > Failed: 0 (0.00%) > > > > Tx time: 37825.72833s | Tx bytes/s: 76138.65 | Tx pkts/s: 951.73 > > > > Rx time: 37825.72939s | Rx bytes/s: 38069.33 | Rx pkts/s: 951.73 > > > > Nping done: 1 IP address pinged in 37844.55 seconds > > > > > > > > > > > > As you can see from the above, there are no failed connects at > > > all > > > from the > > > > 36 million established connections. The average delay is 0.008ms > > > and it was > > > > sending on average almost 1000 packets per second. I've got the > > > same results > > > > from other servers. > > > > > > > > Unless you have other tests in mind, I think there are no issues > > > with the > > > > network. > > > > > > > > I fire up another test for 24 hours this time to see if it makes > > > a > > > > difference. > > > > > > > > Thanks > > > > > > > > Andrei > > > > > > > > > > > > ________________________________ > > > > From: "Samuel Just" <sam.j...@inktank.com> > > > > To: "Andrei Mikhailovsky" <and...@arhont.com> > > > > Cc: ceph-users@lists.ceph.com > > > > Sent: Wednesday, 19 November, 2014 9:45:40 PM > > > > > > > > Subject: Re: [ceph-users] Giant upgrade - stability issues > > > > > > > > Well, the heartbeats are failing due to networking errors > > > preventing > > > > the heartbeats from arriving. That is causing osds to go down, > > > and > > > > that is causing pgs to become degraded. You'll have to work out > > > what > > > > is preventing the tcp connections from being stable. > > > > -Sam > > > > > > > > On Wed, Nov 19, 2014 at 1:39 PM, Andrei Mikhailovsky > > > <and...@arhont.com> > > > > wrote: > > > >> > > > >>>You indicated that osd 12 and 16 were the ones marked down, but > > >>>it > > > >>>looks like only 0,1,2,3,7 were marked down in the ceph.log you > > >>>sent. > > > >>>The logs for 12 and 16 did indicate that they had been > > >>>partitioned > > > >>>from the other nodes. I'd bet that you are having intermittent > > > >>>network trouble since the heartbeats are intermittently failing. > > > >>>-Sam > > > >> > > > >> AM: I will check the logs further for the osds 12 and 16. > > >> Perhaps > > >> I've > > > >> missed something, but the ceph osd tree output was showing 12 > > >> and > > >> 16 as > > > >> down. > > > >> > > > >> Regarding the failure of heartbeats, Wido has suggested that I > > >> should > > > >> investigate the reason for it's failure. The obvious thing to > > >> look > > >> at is > > > >> the > > > >> network and this is what I've initially done. However, I do not > > >> see any > > > >> signs of the network issues. There are no errors on the physical > > >> interface > > > >> and ifconfig is showing a very small number of TX dropped > > >> packets > > > >> (0.00006%) > > > >> and 0 errors: > > > >> > > > >> > > > >> # ifconfig ib0 > > > >> ib0 Link encap:UNSPEC HWaddr > > > >> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 > > > >> inet addr:192.168.168.200 Bcast:192.168.168.255 > > > >> Mask:255.255.255.0 > > > >> inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link > > > >> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > > >> RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0 > > > >> TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0 > > > >> collisions:0 txqueuelen:2048 > > > >> RX bytes:6252740293262 (6.2 TB) TX bytes:11343307665152 (11.3 > > > >> TB) > > > >> > > > >> > > > >> How would I investigate what is happening with the hearbeats and > > >> the > > > >> reason > > > >> for their failures? I have a suspetion that this will solve the > > >> issues > > > >> with > > > >> frequent reporting of degraded PGs on the cluster and > > >> intermittent > > >> high > > > >> levels of IO wait on vms. > > > >> > > > >> And also, as i've previously mentioned, the issues started to > > >> happen after > > > >> the upgrade to Giant. I've not had these problems with Firefly, > > >> Emperor or > > > >> Dumpling releases on the same hardware and same cluster loads. > > > >> > > > >> Thanks > > > >> > > > >> Andrei > > > >> > > > >> > > > >> > > > >> > > > >> On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky > > >> <and...@arhont.com> > > > >> wrote: > > > >>> Sam, > > > >>> > > > >>> Pastebin or similar will not take tens of megabytes worth of > > >>> logs. If we > > > >>> are > > > >>> talking about debug_ms 10 setting, I've got about 7gb worth of > > >>> logs > > > >>> generated every half an hour or so. Not really sure what to do > > >>> with that > > > >>> much data. Anything more constructive? > > > >>> > > > >>> Thanks > > > >>> ________________________________ > > > >>> From: "Samuel Just" <sam.j...@inktank.com> > > > >>> To: "Andrei Mikhailovsky" <and...@arhont.com> > > > >>> Cc: ceph-users@lists.ceph.com > > > >>> Sent: Tuesday, 18 November, 2014 8:53:47 PM > > > >>> > > > >>> Subject: Re: [ceph-users] Giant upgrade - stability issues > > > >>> > > > >>> pastebin or something, probably. > > > >>> -Sam > > > >>> > > > >>> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky > > >>> <and...@arhont.com> > > > >>> wrote: > > > >>>> Sam, the logs are rather large in size. Where should I post it > > >>>> to? > > > >>>> > > > >>>> Thanks > > > >>>> ________________________________ > > > >>>> From: "Samuel Just" <sam.j...@inktank.com> > > > >>>> To: "Andrei Mikhailovsky" <and...@arhont.com> > > > >>>> Cc: ceph-users@lists.ceph.com > > > >>>> Sent: Tuesday, 18 November, 2014 7:54:56 PM > > > >>>> Subject: Re: [ceph-users] Giant upgrade - stability issues > > > >>>> > > > >>>> > > > >>>> Ok, why is ceph marking osds down? Post your ceph.log from one > > >>>> of the > > > >>>> problematic periods. > > > >>>> -Sam > > > >>>> > > > >>>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky > > >>>> <and...@arhont.com> > > > >>>> wrote: > > > >>>>> Hello cephers, > > > >>>>> > > > >>>>> I need your help and suggestion on what is going on with my > > >>>>> cluster. A > > > >>>>> few > > > >>>>> weeks ago i've upgraded from Firefly to Giant. I've > > >>>>> previously > > >>>>> written > > > >>>>> about > > > >>>>> having issues with Giant where in two weeks period the > > >>>>> cluster's IO > > > >>>>> froze > > > >>>>> three times after ceph down-ed two osds. I have in total just > > >>>>> 17 osds > > > >>>>> between two osd servers, 3 mons. The cluster is running on > > >>>>> Ubuntu 12.04 > > > >>>>> with > > > >>>>> latest updates. > > > >>>>> > > > >>>>> I've got zabbix agents monitoring the osd servers and the > > >>>>> cluster. I > > > >>>>> get > > > >>>>> alerts of any issues, such as problems with PGs, etc. Since > > >>>>> upgrading > > > >>>>> to > > > >>>>> Giant, I am now frequently seeing emails alerting of the > > >>>>> cluster having > > > >>>>> degraded PGs. I am getting around 10-15 such emails per day > > >>>>> stating > > > >>>>> that > > > >>>>> the > > > >>>>> cluster has degraded PGs. The number of degraded PGs very > > >>>>> between a > > > >>>>> couple > > > >>>>> of PGs to over a thousand. After several minutes the cluster > > >>>>> repairs > > > >>>>> itself. > > > >>>>> The total number of PGs in the cluster is 4412 between all > > >>>>> the > > >>>>> pools. > > > >>>>> > > > >>>>> I am also seeing more alerts from vms stating that there is a > > >>>>> high IO > > > >>>>> wait > > > >>>>> and also seeing hang tasks. Some vms reporting over 50% io > > >>>>> wait. > > > >>>>> > > > >>>>> This has not happened on Firefly or the previous releases of > > >>>>> ceph. Not > > > >>>>> much > > > >>>>> has changed in the cluster since the upgrade to Giant. > > >>>>> Networking and > > > >>>>> hardware is still the same and it is still running the same > > >>>>> version of > > > >>>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I > > >>>>> think the > > > >>>>> issues > > > >>>>> above are related to the upgrade of ceph to Giant. > > > >>>>> > > > >>>>> Here is the ceph.conf that I use: > > > >>>>> > > > >>>>> [global] > > > >>>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe > > > >>>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, > > > >>>>> arh-cloud13-ib > > > >>>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 > > > >>>>> auth_supported = cephx > > > >>>>> osd_journal_size = 10240 > > > >>>>> filestore_xattr_use_omap = true > > > >>>>> public_network = 192.168.168.0/24 > > > >>>>> rbd_default_format = 2 > > > >>>>> osd_recovery_max_chunk = 8388608 > > > >>>>> osd_recovery_op_priority = 1 > > > >>>>> osd_max_backfills = 1 > > > >>>>> osd_recovery_max_active = 1 > > > >>>>> osd_recovery_threads = 1 > > > >>>>> filestore_max_sync_interval = 15 > > > >>>>> filestore_op_threads = 8 > > > >>>>> filestore_merge_threshold = 40 > > > >>>>> filestore_split_multiple = 8 > > > >>>>> osd_disk_threads = 8 > > > >>>>> osd_op_threads = 8 > > > >>>>> osd_pool_default_pg_num = 1024 > > > >>>>> osd_pool_default_pgp_num = 1024 > > > >>>>> osd_crush_update_on_start = false > > > >>>>> > > > >>>>> [client] > > > >>>>> rbd_cache = true > > > >>>>> admin_socket = /var/run/ceph/$name.$pid.asok > > > >>>>> > > > >>>>> > > > >>>>> I would like to get to the bottom of these issues. Not sure > > >>>>> if > > >>>>> the > > > >>>>> issues > > > >>>>> could be fixed with changing some settings in ceph.conf or a > > >>>>> full > > > >>>>> downgrade > > > >>>>> back to the Firefly. Is the downgrade even possible on a > > >>>>> production > > > >>>>> cluster? > > > >>>>> > > > >>>>> Thanks for your help > > > >>>>> > > > >>>>> Andrei > > > >>>>> > > > >>>>> _______________________________________________ > > > >>>>> ceph-users mailing list > > > >>>>> ceph-users@lists.ceph.com > > > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com