Following Jake's recommendation I have updated my sysctl.conf file and it seems to have helped with the problem of osds being marked down by other osd peers. It has been 3 days already. I am currently using the following settings in the sysctl.conf:
# Increase Linux autotuning TCP buffer limits # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE # Don't set tcp_mem itself! Let the kernel scale it based on RAM. net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.core.rmem_default = 134217728 net.core.wmem_default = 134217728 net.core.optmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 67108864 net.ipv4.tcp_wmem = 4096 65536 67108864 # Make room for more TIME_WAIT sockets due to more clients, # and allow them to be reused if we run out of sockets # Also increase the max packet backlog net.core.somaxconn = 1024 net.core.netdev_max_backlog = 250000 net.ipv4.tcp_max_syn_backlog = 30000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 10 # Disable TCP slow start on idle connections net.ipv4.tcp_slow_start_after_idle = 0 # If your servers talk UDP, also up these limits net.ipv4.udp_rmem_min = 8192 net.ipv4.udp_wmem_min = 8192 # Disable source routing and redirects net.ipv4.conf.all.send_redirects = 0 net.ipv4.conf.all.accept_redirects = 0 net.ipv4.conf.all.accept_source_route = 0 # Mellanox recommended changes net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 1 net.ipv4.tcp_low_latency = 1 net.ipv4.tcp_adv_win_scale = 1 Jake, thanks for your suggestions. Andrei ----- Original Message ----- > From: "Jake Young" <jak3...@gmail.com> > To: "Andrei Mikhailovsky" <and...@arhont.com>, > ceph-users@lists.ceph.com > Sent: Saturday, 6 December, 2014 5:02:15 PM > Subject: Re: [ceph-users] Giant osd problems - loss of IO > Forgot to copy the list. > > I basically cobbled together the settings from examples on the > > internet. > > > I basically modified this sysctl.conf file with his suggestion for > > 10gb nics > > > http://www.nateware.com/linux-network-tuning-for-2013.html#.VIG_44eLTII > > > I found these sites helpful as well: > > > http://fasterdata.es.net/host-tuning/linux/ > > > This may be of interest to you, it has suggestions for your > > Mellanox > > hardware: > > https://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/ > > > Fermilab website, link to university research papaer > > > https://indico.fnal.gov/getFile.py/access?contribId=30&sessionId=19&resId=0&materialId=paper&confId=3377 > > > This has a great answer that explains different configurations for > > servers vs clients. It seems to me that osds are both servers and > > clients, so maybe some of the client tuning would benefit osds as > > well. This is where I got the somaxconn setting from. > > > http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux > > > I forgot to mention, I'm also setting the txqueuelen for my ceph > > public nic and ceph private nic in the /etc/rc.local file: > > > /sbin/ifconfig eth0 txqueuelen 10000 > > > /sbin/ifconfig eth1 txqueuelen 10000 > > > I do push the same sysctl.conf and rc.local to all of my clients as > > well. The clients are iSCSI servers which serve vmware hosts. My > > ceph cluster is rbd only and I currently only have the iSCSI proxy > > server clients. We'll be adding some KVM hypervisors soon, I'm > > interested to see how they perform vs my vmware --> iSCSI Server > > --> > > Ceph setup. > > > Regarding your sysctl.conf file: > > > I've read on a few different sites that net.ipv4.tcp_mem should not > > be tuned, since the defaults are good. I have not set it, and I > > can't speak to the benefit/problems with setting it. > > > You're configured to only use a 4MB TCP buffer, which is very > > small. > > It is actually smaller than the defaults for tcp_wmem, which is > > 6MB. > > The link above suggests up to a 128MB TCP buffer for the 40gb > > Mellanox and/or 10gb over a WAN (not sure how to read that). I'm > > using a 54MB buffer, but I may increase mine to 128MB to see if > > there is any benefit. That 4MB buffer may be your problem. > > > Your net.core.netdev_max_backlog is 5x bigger than mine. I think > > I'll > > increase my setting to 250000 as well. > > > Our issue looks like http://tracker.ceph.com/issues/9844 and my > > crash > > looks like http://tracker.ceph.com/issues/9788 > > > On Fri, Dec 5, 2014 at 5:35 AM, Andrei Mikhailovsky < > > and...@arhont.com > wrote: > > > > Jake, > > > > > > very usefull indeed. > > > > > > It looks like I had a similar problem regarding the heartbeat and > > > as > > > you' have mentioned, I've not seen such issues on Firefly. > > > However, > > > i've not seen any osd crashes. > > > > > > Could you please let me know where you got the sysctrl.conf > > > tunings > > > from? Was it recommended by the network vendor? > > > > > > Also, did you make similar sysctrl.conf changes to your host > > > servers? > > > > > > A while ago i've read the tunning guide for IP over Infiniband > > > and > > > the Mellanox recommends setting something like this: > > > > > > net.ipv4.tcp_timestamps = 0 > > > > > > net.ipv4.tcp_sack = 1 > > > > > > net.core.netdev_max_backlog = 250000 > > > > > > net.core.rmem_max = 4194304 > > > > > > net.core.wmem_max = 4194304 > > > > > > net.core.rmem_default = 4194304 > > > > > > net.core.wmem_default = 4194304 > > > > > > net.core.optmem_max = 4194304 > > > > > > net.ipv4.tcp_rmem = 4096 87380 4194304 > > > > > > net.ipv4.tcp_wmem = 4096 65536 4194304 > > > > > > net.ipv4.tcp_mem =4194304 4194304 4194304 > > > > > > net.ipv4.tcp_low_latency=1 > > > > > > which is what I have. Not sure if these are optimal. > > > > > > I can see that the values are pretty conservative compare to > > > yours. > > > I > > > guess my values should be different as I am running a 40gbit/s > > > network with ipoib. The actual throughput on ipoib is about > > > 20gbit/s > > > according iperf and alike. > > > > > > Andrei > > > > > > > From: "Jake Young" < jak3...@gmail.com > > > > > > > > > > > To: "Andrei Mikhailovsky" < and...@arhont.com > > > > > > > > > > > Cc: ceph-users@lists.ceph.com > > > > > > > > > > Sent: Thursday, 4 December, 2014 4:57:47 PM > > > > > > > > > > Subject: Re: [ceph-users] Giant osd problems - loss of IO > > > > > > > > > > On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky < > > > > and...@arhont.com > wrote: > > > > > > > > > > > > > > > > > > > > > > Any other suggestions why several osds are going down on > > > > > Giant > > > > > and > > > > > causing IO to stall? This was not happening on Firefly. > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I had a very similar probem to yours which started after > > > > upgrading > > > > from Firefly to Giant and then later I added two new osd nodes, > > > > with > > > > 7 osds on each. > > > > > > > > > > My cluster originally had 4 nodes, with 7 osds on each node, 28 > > > > osds > > > > total, running Gian. I did not have any problems at this time. > > > > > > > > > > My problems started after adding two new nodes, so I had 6 > > > > nodes > > > > and > > > > 42 total osds. It would run fine on low load, but when the > > > > request > > > > load increased, osds started to fall over. > > > > > > > > > > I was able to set the debug_ms to 10 and capture the logs from > > > > a > > > > failed OSD. There were a few different reasons the osds were > > > > going > > > > down. This example shows it terminating normally for an > > > > unspecified > > > > reason a minute after it notices it is marked down in the map. > > > > > > > > > > Osd 25 actually marks this osd (osd 35) down. For some reason > > > > many > > > > osds cannot communicate with each other. > > > > > > > > > > There are other examples where I see the "heartbeat_check: no > > > > reply > > > > from osd.blah" message for long periods of time (hours) and > > > > neither > > > > osd crashes or terminates. > > > > > > > > > > 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff > > > > 2014-12-01 > > > > 16:26:46.772608) > > > > > > > > > > 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff > > > > 2014-12-01 > > > > 16:26:47.772759) > > > > > > > > > > 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff > > > > 2014-12-01 > > > > 16:26:48.772982) > > > > > > > > > > 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff > > > > 2014-12-01 > > > > 16:26:49.559891) > > > > > > > > > > 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff > > > > 2014-12-01 > > > > 16:26:49.773173) > > > > > > > > > > 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff > > > > 2014-12-01 > > > > 16:26:50.773299) > > > > > > > > > > 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff > > > > 2014-12-01 > > > > 16:26:51.261554) > > > > > > > > > > 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff > > > > 2014-12-01 > > > > 16:26:51.773504) > > > > > > > > > > 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff > > > > 2014-12-01 > > > > 16:26:52.773733) > > > > > > > > > > 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff > > > > 2014-12-01 > > > > 16:26:53.773876) > > > > > > > > > > 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff > > > > 2014-12-01 > > > > 16:26:54.163366) > > > > > > > > > > 2014-12-01 16:27:14.507632 7f8b4fb7f700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.5:6802/2755 pipe(0x2af06940 sd=57 :51521 s=2 pgs=384 > > > > cs=1 > > > > l=0 c=0x2af094a0).fault with nothing to send, going to standby > > > > > > > > > > 2014-12-01 16:27:14.511704 7f8b37af1700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.2:6812/34015988 pipe(0x2af06c00 sd=69 :41512 s=2 > > > > pgs=38842 > > > > cs=1 l=0 c=0x2af09600).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:14.511966 7f8b5030c700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.4:6802/40022302 pipe(0x30cbcdc0 sd=93 :6802 s=2 > > > > pgs=66722 > > > > cs=3 l=0 c=0x2af091e0).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:14.514744 7f8b548a5700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.2:6800/9016639 pipe(0x2af04dc0 sd=38 :60965 s=2 > > > > pgs=11747 > > > > cs=1 l=0 c=0x2af086e0).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:14.516712 7f8b349c7700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.2:6802/25277 pipe(0x2b04cc00 sd=166 :6802 s=2 pgs=62 > > > > cs=1 > > > > l=0 c=0x2b043080).fault with nothing to send, going to standby > > > > > > > > > > 2014-12-01 16:27:14.516814 7f8b2bd3b700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.4:6804/16770 pipe(0x30cbd600 sd=79 :6802 s=2 pgs=607 > > > > cs=3 > > > > l=0 c=0x2af08c60).fault with nothing to send, going to standby > > > > > > > > > > 2014-12-01 16:27:14.518439 7f8b2a422700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.5:6806/31172 pipe(0x30cbc840 sd=28 :6802 s=2 pgs=22 > > > > cs=1 > > > > l=0 > > > > c=0x3041f5a0).fault with nothing to send, going to standby > > > > > > > > > > 2014-12-01 16:27:14.518883 7f8b589ba700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.1:6803/4031631 pipe(0x2af042c0 sd=32 :58296 s=2 > > > > pgs=35500 > > > > cs=3 l=0 c=0x2af08160).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:14.519271 7f8b5040d700 0 -- > > > > 172.1.2.6:6802/5210 > > > > >> > > > > 172.1.2.2:6816/32028847 pipe(0x2af05e40 sd=49 :54016 s=2 > > > > pgs=30500 > > > > cs=1 l=0 c=0x2af08f20).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:14.774081 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff > > > > 2014-12-01 > > > > 16:26:54.774073) > > > > > > > > > > 2014-12-01 16:27:15.774290 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff > > > > 2014-12-01 > > > > 16:26:55.774281) > > > > > > > > > > 2014-12-01 16:27:16.774480 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff > > > > 2014-12-01 > > > > 16:26:56.774471) > > > > > > > > > > 2014-12-01 16:27:17.774670 7f8b642d1700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff > > > > 2014-12-01 > > > > 16:26:57.774661) > > > > > > > > > > 2014-12-01 16:27:18.264884 7f8b3b1fe700 -1 osd.35 79679 > > > > heartbeat_check: no reply from osd.25 since back 2014-12-01 > > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff > > > > 2014-12-01 > > > > 16:26:58.264882) > > > > > > > > > > 2014-12-01 16:27:18.268852 7f8b4c220700 0 log_channel(default) > > > > log > > > > [WRN] : map e79681 wrongly marked me down > > > > > > > > > > 2014-12-01 16:27:22.290362 7f8b37df4700 0 -- > > > > 172.1.2.6:6806/1005210 > > > > >> 172.1.2.4:6804/16770 pipe(0x2c2ef8c0 sd=75 :44216 s=2 > > > > >> pgs=632 > > > > cs=1 l=0 c=0x1bce6940).fault with nothing to send, going to > > > > standby > > > > > > > > > > 2014-12-01 16:27:22.866677 7f8b56662700 0 -- > > > > 172.1.2.6:6806/1005210 > > > > >> 172.1.2.2:6808/9013936 pipe(0x2f5e4840 sd=55 :41925 s=2 > > > > >> pgs=15111 > > > > cs=1 l=0 c=0x30828580).fault with nothing to send, going to > > > > stan > > > > > > > > > > dby > > > > > > > > > > 2014-12-01 16:27:24.854642 7f8b2ed6b700 0 -- > > > > 172.1.2.6:6806/1005210 > > > > >> 172.1.2.6:6816/62000664 pipe(0x2c3c78c0 sd=206 :6806 s=0 > > > > >> pgs=0 > > > > cs=0 l=0 c=0x2ccb3e40).accept connect_seq 0 vs existing 0 state > > > > con > > > > > > > > > > necting > > > > > > > > > > 2014-12-01 16:27:25.265306 7f8b2af2d700 0 -- > > > > 172.1.1.6:6806/5210 > > > > >> > > > > 172.1.1.54:0/586219983 pipe(0x2d46bc80 sd=246 :6806 s=0 pgs=0 > > > > cs=0 > > > > l=0 c=0x1bdacdc0).accept peer addr is really > > > > 172.1.1.54:0/58621998 > > > > > > > > > > 3 (socket is 172.1.1.54:36423/0 ) > > > > > > > > > > 2014-12-01 16:28:45.732468 7f8b368df700 -1 osd.35 79691 *** Got > > > > signal Terminated *** > > > > > > > > > > 2014-12-01 16:28:45.732591 7f8b368df700 0 osd.35 79691 > > > > prepare_to_stop telling mon we are shutting down > > > > > > > > > > 2014-12-01 16:28:46.586316 7f8b2236a700 0 -- > > > > 172.1.2.6:6806/1005210 > > > > >> 172.1.2.1:6807/91014386 pipe(0x1cabb700 sd=32 :53651 s=2 > > > > pgs=37459 cs=1 l=0 c=0x1bce5e40).fault with nothing to send, > > > > going > > > > to sta > > > > > > > > > > ndby > > > > > > > > > > 2014-12-01 16:28:46.593615 7f8b4c220700 0 osd.35 79691 > > > > got_stop_ack > > > > starting shutdown > > > > > > > > > > 2014-12-01 16:28:46.593662 7f8b368df700 0 osd.35 79691 > > > > prepare_to_stop starting shutdown > > > > > > > > > > 2014-12-01 16:28:46.593682 7f8b368df700 -1 osd.35 79691 > > > > shutdown > > > > > > > > > > Another example I found shows this same osd crashing with "hit > > > > suicide timeout". > > > > > > > > > > -4> 2014-12-01 15:50:52.333350 7fecda368700 10 -- > > > > 172.1.2.6:6805/1031451 >> 172.1.2.1:0/32541 pipe(0x2b70a680 > > > > sd=104 > > > > :6805 s=2 pgs=2692 cs=1 l=1 c=0x2d754680).writer: state = open > > > > policy.server=1 > > > > > > > > > > -3> 2014-12-01 15:50:52.333348 7fecd9d62700 10 -- > > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 > > > > sd=94 > > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open > > > > policy.server=1 > > > > > > > > > > -2> 2014-12-01 15:50:52.333369 7fecd9d62700 10 -- > > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 > > > > sd=94 > > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).write_ack 10 > > > > > > > > > > -1> 2014-12-01 15:50:52.333386 7fecd9d62700 10 -- > > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 > > > > sd=94 > > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open > > > > policy.server=1 > > > > > > > > > > 0> 2014-12-01 15:50:52.531714 7fed13221700 -1 > > > > common/HeartbeatMap.cc: > > > > In function 'bool > > > > ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const > > > > char*, > > > > time_t)' thread 7fed13221700 time 2014-12- > > > > > > > > > > 01 15:50:52.508265 > > > > > > > > > > common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide > > > > timeout") > > > > > > > > > > ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) > > > > > > > > > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, > > > > char > > > > const*)+0x8b) [0xb8231b] > > > > > > > > > > 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char > > > > const*, long)+0x2a9) [0xac0e19] > > > > > > > > > > 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6] > > > > > > > > > > 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87] > > > > > > > > > > 5: (CephContextServiceThread::entry()+0x154) [0xb96844] > > > > > > > > > > 6: (()+0x8182) [0x7fed175c7182] > > > > > > > > > > 7: (clone()+0x6d) [0x7fed15b31fbd] > > > > > > > > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` > > > > is > > > > needed to interpret this. > > > > > > > > > > --- logging levels --- > > > > > > > > > > 0/ 5 none > > > > > > > > > > 0/ 0 lockdep > > > > > > > > > > 0/ 0 context > > > > > > > > > > 0/ 0 crush > > > > > > > > > > 1/ 5 mds > > > > > > > > > > 1/ 5 mds_balancer > > > > > > > > > > 1/ 5 mds_locker > > > > > > > > > > 1/ 5 mds_log > > > > > > > > > > 1/ 5 mds_log_expire > > > > > > > > > > 1/ 5 mds_migrator > > > > > > > > > > 0/ 0 buffer > > > > > > > > > > 0/ 0 timer > > > > > > > > > > 0/ 1 filer > > > > > > > > > > 0/ 1 striper > > > > > > > > > > 0/ 1 objecter > > > > > > > > > > 0/ 5 rados > > > > > > > > > > 0/ 5 rbd > > > > > > > > > > 0/ 5 rbd_replay > > > > > > > > > > 0/ 0 journaler > > > > > > > > > > 0/ 5 objectcacher > > > > > > > > > > 0/ 5 client > > > > > > > > > > 0/ 0 osd > > > > > > > > > > 0/ 0 optracker > > > > > > > > > > 0/ 0 objclass > > > > > > > > > > 0/ 0 filestore > > > > > > > > > > 1/ 3 keyvaluestore > > > > > > > > > > 0/ 0 journal > > > > > > > > > > 10/10 ms > > > > > > > > > > 1/ 5 mon > > > > > > > > > > 0/ 0 monc > > > > > > > > > > 1/ 5 paxos > > > > > > > > > > 0/ 0 tp > > > > > > > > > > 0/ 0 auth > > > > > > > > > > 1/ 5 crypto > > > > > > > > > > 0/ 0 finisher > > > > > > > > > > 0/ 0 heartbeatmap > > > > > > > > > > 0/ 0 perfcounter > > > > > > > > > > 1/ 5 rgw > > > > > > > > > > 1/10 civetweb > > > > > > > > > > 1/ 5 javaclient > > > > > > > > > > 0/ 0 asok > > > > > > > > > > 0/ 0 throttle > > > > > > > > > > 0/ 0 refs > > > > > > > > > > -2/-2 (syslog threshold) > > > > > > > > > > -1/-1 (stderr threshold) > > > > > > > > > > max_recent 10000 > > > > > > > > > > max_new 1000 > > > > > > > > > > log_file /var/log/ceph/ceph-osd.35.log > > > > > > > > > > --- end dump of recent events --- > > > > > > > > > > 2014-12-01 15:50:52.627789 7fed13221700 -1 *** Caught signal > > > > (Aborted) ** > > > > > > > > > > in thread 7fed13221700 > > > > > > > > > > ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578) > > > > > > > > > > 1: /usr/bin/ceph-osd() [0xa9767a] > > > > > > > > > > 2: (()+0x10340) [0x7fed175cf340] > > > > > > > > > > 3: (gsignal()+0x39) [0x7fed15a6dbb9] > > > > > > > > > > 4: (abort()+0x148) [0x7fed15a70fc8] > > > > > > > > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) > > > > [0x7fed163796b5] > > > > > > > > > > 6: (()+0x5e836) [0x7fed16377836] > > > > > > > > > > 7: (()+0x5e863) [0x7fed16377863] > > > > > > > > > > 8: (()+0x5eaa2) [0x7fed16377aa2] > > > > > > > > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, > > > > char > > > > const*)+0x278) [0xb82508] > > > > > > > > > > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char > > > > const*, long)+0x2a9) [0xac0e19] > > > > > > > > > > 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6] > > > > > > > > > > 12: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87] > > > > > > > > > > 13: (CephContextServiceThread::entry()+0x154) [0xb96844] > > > > > > > > > > 14: (()+0x8182) [0x7fed175c7182] > > > > > > > > > > 15: (clone()+0x6d) [0x7fed15b31fbd] > > > > > > > > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` > > > > is > > > > needed to interpret this. > > > > > > > > > > --- logging levels --- > > > > > > > > > > 0/ 5 none > > > > > > > > > > 0/ 0 lockdep > > > > > > > > > > 0/ 0 context > > > > > > > > > > 0/ 0 crush > > > > > > > > > > 1/ 5 mds > > > > > > > > > > 1/ 5 mds_balancer > > > > > > > > > > 1/ 5 mds_locker > > > > > > > > > > 1/ 5 mds_log > > > > > > > > > > 1/ 5 mds_log_expire > > > > > > > > > > 1/ 5 mds_migrator > > > > > > > > > > 0/ 0 buffer > > > > > > > > > > 0/ 0 timer > > > > > > > > > > 0/ 1 filer > > > > > > > > > > 0/ 1 striper > > > > > > > > > > 0/ 1 objecter > > > > > > > > > > 0/ 5 rados > > > > > > > > > > 0/ 5 rbd > > > > > > > > > > 0/ 5 rbd_replay > > > > > > > > > > 0/ 0 journaler > > > > > > > > > > 0/ 5 objectcacher > > > > > > > > > > 0/ 5 client > > > > > > > > > > 0/ 0 osd > > > > > > > > > > 0/ 0 optracker > > > > > > > > > > 0/ 0 objclass > > > > > > > > > > 0/ 0 filestore > > > > > > > > > > 1/ 3 keyvaluestore > > > > > > > > > > 0/ 0 journal > > > > > > > > > > 10/10 ms > > > > > > > > > > 1/ 5 mon > > > > > > > > > > 0/ 0 monc > > > > > > > > > > 1/ 5 paxos > > > > > > > > > > 0/ 0 tp > > > > > > > > > > 0/ 0 auth > > > > > > > > > > 1/ 5 crypto > > > > > > > > > > 0/ 0 finisher > > > > > > > > > > 0/ 0 heartbeatmap > > > > > > > > > > 0/ 0 perfcounter > > > > > > > > > > 1/ 5 rgw > > > > > > > > > > 1/10 civetweb > > > > > > > > > > 1/ 5 javaclient > > > > > > > > > > 0/ 0 asok > > > > > > > > > > 0/ 0 throttle > > > > > > > > > > 0/ 0 refs > > > > > > > > > > -2/-2 (syslog threshold) > > > > > > > > > > -1/-1 (stderr threshold) > > > > > > > > > > max_recent 10000 > > > > > > > > > > max_new 1000 > > > > > > > > > > log_file /var/log/ceph/ceph-osd.35.log > > > > > > > > > > --- end dump of recent events --- > > > > > > > > > > The recurring theme here is that there is a communication issue > > > > between the osds. > > > > > > > > > > I looked carefully at my network hardware configuration (UCS > > > > C240s > > > > with 40Gbps Cisco VICs connected to a pair of Nexus 5672 using > > > > A-FEX > > > > Port Profile configuration) and couldn't find any dropped > > > > packets > > > > or > > > > errors. > > > > > > > > > > I ran "ss -s" for the first time on my osds and was a bit > > > > suprised > > > > to > > > > see how many open TCP connections they all have. > > > > > > > > > > ceph@osd6:/var/log/ceph$ ss -s > > > > > > > > > > Total: 1492 (kernel 0) > > > > > > > > > > TCP: 1411 (estab 1334, closed 40, orphaned 0, synrecv 0, > > > > timewait > > > > 0/0), ports 0 > > > > > > > > > > Transport Total IP IPv6 > > > > > > > > > > * 0 - - > > > > > > > > > > RAW 0 0 0 > > > > > > > > > > UDP 10 4 6 > > > > > > > > > > TCP 1371 1369 2 > > > > > > > > > > INET 1381 1373 8 > > > > > > > > > > FRAG 0 0 0 > > > > > > > > > > While researching if additional kernel tuning would be required > > > > to > > > > handle so many connections, I eventually realized that I forgot > > > > to > > > > copy my customized /etc/sysctl.conf file on the two new nodes. > > > > I'm > > > > not sure if the large amount of TCP connections is part of the > > > > performance enhancements between Giant and Firefly, or if > > > > Firefly > > > > uses a similar number of connections. > > > > > > > > > > I highlighted the tuning parameters that I suspect are > > > > required. > > > > > > > > > > ceph@osd6:/var/log/ceph$ cat /etc/sysctl.conf > > > > > > > > > > # > > > > > > > > > > # /etc/sysctl.conf - Configuration file for setting system > > > > variables > > > > > > > > > > # See /etc/sysctl.d/ for additional system variables > > > > > > > > > > # See sysctl.conf (5) for information. > > > > > > > > > > # > > > > > > > > > > # Increase Linux autotuning TCP buffer limits > > > > > > > > > > # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) > > > > for > > > > 10GE > > > > > > > > > > # Don't set tcp_mem itself! Let the kernel scale it based on > > > > RAM. > > > > > > > > > > net.core.rmem_max = 56623104 > > > > > > > > > > net.core.wmem_max = 56623104 > > > > > > > > > > net.core.rmem_default = 56623104 > > > > > > > > > > net.core.wmem_default = 56623104 > > > > > > > > > > net.core.optmem_max = 40960 > > > > > > > > > > net.ipv4.tcp_rmem = 4096 87380 56623104 > > > > > > > > > > net.ipv4.tcp_wmem = 4096 65536 56623104 > > > > > > > > > > # Make room for more TIME_WAIT sockets due to more clients, > > > > > > > > > > # and allow them to be reused if we run out of sockets > > > > > > > > > > # Also increase the max packet backlog > > > > > > > > > > net.core.somaxconn = 1024 > > > > > > > > > > net.core.netdev_max_backlog = 50000 > > > > > > > > > > net.ipv4.tcp_max_syn_backlog = 30000 > > > > > > > > > > net.ipv4.tcp_max_tw_buckets = 2000000 > > > > > > > > > > net.ipv4.tcp_tw_reuse = 1 > > > > > > > > > > net.ipv4.tcp_fin_timeout = 10 > > > > > > > > > > # Disable TCP slow start on idle connections > > > > > > > > > > net.ipv4.tcp_slow_start_after_idle = 0 > > > > > > > > > > # If your servers talk UDP, also up these limits > > > > > > > > > > net.ipv4.udp_rmem_min = 8192 > > > > > > > > > > net.ipv4.udp_wmem_min = 8192 > > > > > > > > > > # Disable source routing and redirects > > > > > > > > > > net.ipv4.conf.all.send_redirects = 0 > > > > > > > > > > net.ipv4.conf.all.accept_redirects = 0 > > > > > > > > > > net.ipv4.conf.all.accept_source_route = 0 > > > > > > > > > > I added the net.core.somaxconn after this experience, since the > > > > default is 128. This represents the allowed socket backlog in > > > > the > > > > kernel; which should help when I reboot an osd node and 1300 > > > > connections need to be made quickly. > > > > > > > > > > I found that I needed to restart my osds after applying the > > > > kernel > > > > tuning above for my cluster to stabilize. > > > > > > > > > > My system is now stable again and performs very well. > > > > > > > > > > I hope this helps someone out. It took me a few days of > > > > troubleshooting to get to the bottom of this. > > > > > > > > > > Jake > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com