Jake, 

very usefull indeed. 

It looks like I had a similar problem regarding the heartbeat and as you' have 
mentioned, I've not seen such issues on Firefly. However, i've not seen any osd 
crashes. 

Could you please let me know where you got the sysctrl.conf tunings from? Was 
it recommended by the network vendor? 

Also, did you make similar sysctrl.conf changes to your host servers? 

A while ago i've read the tunning guide for IP over Infiniband and the Mellanox 
recommends setting something like this: 

net.ipv4.tcp_timestamps = 0 
net.ipv4.tcp_sack = 1 
net.core.netdev_max_backlog = 250000 
net.core.rmem_max = 4194304 
net.core.wmem_max = 4194304 
net.core.rmem_default = 4194304 
net.core.wmem_default = 4194304 
net.core.optmem_max = 4194304 
net.ipv4.tcp_rmem = 4096 87380 4194304 
net.ipv4.tcp_wmem = 4096 65536 4194304 
net.ipv4.tcp_mem =4194304 4194304 4194304 
net.ipv4.tcp_low_latency=1 

which is what I have. Not sure if these are optimal. 

I can see that the values are pretty conservative compare to yours. I guess my 
values should be different as I am running a 40gbit/s network with ipoib. The 
actual throughput on ipoib is about 20gbit/s according iperf and alike. 

Andrei 

----- Original Message -----

> From: "Jake Young" <jak3...@gmail.com>
> To: "Andrei Mikhailovsky" <and...@arhont.com>
> Cc: ceph-users@lists.ceph.com
> Sent: Thursday, 4 December, 2014 4:57:47 PM
> Subject: Re: [ceph-users] Giant osd problems - loss of IO

> On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky <
> and...@arhont.com > wrote:
> >
> > Any other suggestions why several osds are going down on Giant and
> > causing IO to stall? This was not happening on Firefly.
> >
> > Thanks
> >
> >

> I had a very similar probem to yours which started after upgrading
> from Firefly to Giant and then later I added two new osd nodes, with
> 7 osds on each.

> My cluster originally had 4 nodes, with 7 osds on each node, 28 osds
> total, running Gian. I did not have any problems at this time.

> My problems started after adding two new nodes, so I had 6 nodes and
> 42 total osds. It would run fine on low load, but when the request
> load increased, osds started to fall over.

> I was able to set the debug_ms to 10 and capture the logs from a
> failed OSD. There were a few different reasons the osds were going
> down. This example shows it terminating normally for an unspecified
> reason a minute after it notices it is marked down in the map.

> Osd 25 actually marks this osd (osd 35) down. For some reason many
> osds cannot communicate with each other.

> There are other examples where I see the "heartbeat_check: no reply
> from osd.blah" message for long periods of time (hours) and neither
> osd crashes or terminates.

> 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
> 16:26:46.772608)
> 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
> 16:26:47.772759)
> 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
> 16:26:48.772982)
> 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff 2014-12-01
> 16:26:49.559891)
> 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
> 16:26:49.773173)
> 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
> 16:26:50.773299)
> 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff 2014-12-01
> 16:26:51.261554)
> 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
> 16:26:51.773504)
> 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
> 16:26:52.773733)
> 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
> 16:26:53.773876)
> 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff 2014-12-01
> 16:26:54.163366)
> 2014-12-01 16:27:14.507632 7f8b4fb7f700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.5:6802/2755 pipe(0x2af06940 sd=57 :51521 s=2 pgs=384 cs=1
> l=0 c=0x2af094a0).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.511704 7f8b37af1700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.2:6812/34015988 pipe(0x2af06c00 sd=69 :41512 s=2 pgs=38842
> cs=1 l=0 c=0x2af09600).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.511966 7f8b5030c700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.4:6802/40022302 pipe(0x30cbcdc0 sd=93 :6802 s=2 pgs=66722
> cs=3 l=0 c=0x2af091e0).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.514744 7f8b548a5700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.2:6800/9016639 pipe(0x2af04dc0 sd=38 :60965 s=2 pgs=11747
> cs=1 l=0 c=0x2af086e0).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.516712 7f8b349c7700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.2:6802/25277 pipe(0x2b04cc00 sd=166 :6802 s=2 pgs=62 cs=1
> l=0 c=0x2b043080).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.516814 7f8b2bd3b700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.4:6804/16770 pipe(0x30cbd600 sd=79 :6802 s=2 pgs=607 cs=3
> l=0 c=0x2af08c60).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.518439 7f8b2a422700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.5:6806/31172 pipe(0x30cbc840 sd=28 :6802 s=2 pgs=22 cs=1 l=0
> c=0x3041f5a0).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.518883 7f8b589ba700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.1:6803/4031631 pipe(0x2af042c0 sd=32 :58296 s=2 pgs=35500
> cs=3 l=0 c=0x2af08160).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.519271 7f8b5040d700 0 -- 172.1.2.6:6802/5210 >>
> 172.1.2.2:6816/32028847 pipe(0x2af05e40 sd=49 :54016 s=2 pgs=30500
> cs=1 l=0 c=0x2af08f20).fault with nothing to send, going to standby
> 2014-12-01 16:27:14.774081 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff 2014-12-01
> 16:26:54.774073)
> 2014-12-01 16:27:15.774290 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff 2014-12-01
> 16:26:55.774281)
> 2014-12-01 16:27:16.774480 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff 2014-12-01
> 16:26:56.774471)
> 2014-12-01 16:27:17.774670 7f8b642d1700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff 2014-12-01
> 16:26:57.774661)
> 2014-12-01 16:27:18.264884 7f8b3b1fe700 -1 osd.35 79679
> heartbeat_check: no reply from osd.25 since back 2014-12-01
> 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff 2014-12-01
> 16:26:58.264882)
> 2014-12-01 16:27:18.268852 7f8b4c220700 0 log_channel(default) log
> [WRN] : map e79681 wrongly marked me down
> 2014-12-01 16:27:22.290362 7f8b37df4700 0 -- 172.1.2.6:6806/1005210
> >> 172.1.2.4:6804/16770 pipe(0x2c2ef8c0 sd=75 :44216 s=2 pgs=632
> cs=1 l=0 c=0x1bce6940).fault with nothing to send, going to standby
> 2014-12-01 16:27:22.866677 7f8b56662700 0 -- 172.1.2.6:6806/1005210
> >> 172.1.2.2:6808/9013936 pipe(0x2f5e4840 sd=55 :41925 s=2 pgs=15111
> cs=1 l=0 c=0x30828580).fault with nothing to send, going to stan
> dby
> 2014-12-01 16:27:24.854642 7f8b2ed6b700 0 -- 172.1.2.6:6806/1005210
> >> 172.1.2.6:6816/62000664 pipe(0x2c3c78c0 sd=206 :6806 s=0 pgs=0
> cs=0 l=0 c=0x2ccb3e40).accept connect_seq 0 vs existing 0 state con
> necting
> 2014-12-01 16:27:25.265306 7f8b2af2d700 0 -- 172.1.1.6:6806/5210 >>
> 172.1.1.54:0/586219983 pipe(0x2d46bc80 sd=246 :6806 s=0 pgs=0 cs=0
> l=0 c=0x1bdacdc0).accept peer addr is really 172.1.1.54:0/58621998
> 3 (socket is 172.1.1.54:36423/0 )
> 2014-12-01 16:28:45.732468 7f8b368df700 -1 osd.35 79691 *** Got
> signal Terminated ***
> 2014-12-01 16:28:45.732591 7f8b368df700 0 osd.35 79691
> prepare_to_stop telling mon we are shutting down
> 2014-12-01 16:28:46.586316 7f8b2236a700 0 -- 172.1.2.6:6806/1005210
> >> 172.1.2.1:6807/91014386 pipe(0x1cabb700 sd=32 :53651 s=2
> pgs=37459 cs=1 l=0 c=0x1bce5e40).fault with nothing to send, going
> to sta
> ndby
> 2014-12-01 16:28:46.593615 7f8b4c220700 0 osd.35 79691 got_stop_ack
> starting shutdown
> 2014-12-01 16:28:46.593662 7f8b368df700 0 osd.35 79691
> prepare_to_stop starting shutdown
> 2014-12-01 16:28:46.593682 7f8b368df700 -1 osd.35 79691 shutdown

> Another example I found shows this same osd crashing with "hit
> suicide timeout".

> -4> 2014-12-01 15:50:52.333350 7fecda368700 10 --
> 172.1.2.6:6805/1031451 >> 172.1.2.1:0/32541 pipe(0x2b70a680 sd=104
> :6805 s=2 pgs=2692 cs=1 l=1 c=0x2d754680).writer: state = open
> policy.server=1
> -3> 2014-12-01 15:50:52.333348 7fecd9d62700 10 --
> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94
> :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open
> policy.server=1
> -2> 2014-12-01 15:50:52.333369 7fecd9d62700 10 --
> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94
> :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).write_ack 10
> -1> 2014-12-01 15:50:52.333386 7fecd9d62700 10 --
> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94
> :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open
> policy.server=1
> 0> 2014-12-01 15:50:52.531714 7fed13221700 -1 common/HeartbeatMap.cc:
> In function 'bool
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> time_t)' thread 7fed13221700 time 2014-12-
> 01 15:50:52.508265
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

> ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xb8231b]
> 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x2a9) [0xac0e19]
> 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
> 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
> 5: (CephContextServiceThread::entry()+0x154) [0xb96844]
> 6: (()+0x8182) [0x7fed175c7182]
> 7: (clone()+0x6d) [0x7fed15b31fbd]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.

> --- logging levels ---
> 0/ 5 none
> 0/ 0 lockdep
> 0/ 0 context
> 0/ 0 crush

> 1/ 5 mds
> 1/ 5 mds_balancer
> 1/ 5 mds_locker
> 1/ 5 mds_log
> 1/ 5 mds_log_expire
> 1/ 5 mds_migrator
> 0/ 0 buffer
> 0/ 0 timer
> 0/ 1 filer
> 0/ 1 striper
> 0/ 1 objecter
> 0/ 5 rados
> 0/ 5 rbd
> 0/ 5 rbd_replay
> 0/ 0 journaler
> 0/ 5 objectcacher
> 0/ 5 client
> 0/ 0 osd
> 0/ 0 optracker
> 0/ 0 objclass
> 0/ 0 filestore
> 1/ 3 keyvaluestore
> 0/ 0 journal
> 10/10 ms
> 1/ 5 mon
> 0/ 0 monc
> 1/ 5 paxos
> 0/ 0 tp
> 0/ 0 auth
> 1/ 5 crypto
> 0/ 0 finisher
> 0/ 0 heartbeatmap
> 0/ 0 perfcounter
> 1/ 5 rgw
> 1/10 civetweb
> 1/ 5 javaclient
> 0/ 0 asok
> 0/ 0 throttle
> 0/ 0 refs
> -2/-2 (syslog threshold)
> -1/-1 (stderr threshold)
> max_recent 10000
> max_new 1000
> log_file /var/log/ceph/ceph-osd.35.log
> --- end dump of recent events ---
> 2014-12-01 15:50:52.627789 7fed13221700 -1 *** Caught signal
> (Aborted) **
> in thread 7fed13221700

> ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
> 1: /usr/bin/ceph-osd() [0xa9767a]
> 2: (()+0x10340) [0x7fed175cf340]
> 3: (gsignal()+0x39) [0x7fed15a6dbb9]
> 4: (abort()+0x148) [0x7fed15a70fc8]
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fed163796b5]
> 6: (()+0x5e836) [0x7fed16377836]
> 7: (()+0x5e863) [0x7fed16377863]
> 8: (()+0x5eaa2) [0x7fed16377aa2]
> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0xb82508]
> 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x2a9) [0xac0e19]
> 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
> 12: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
> 13: (CephContextServiceThread::entry()+0x154) [0xb96844]
> 14: (()+0x8182) [0x7fed175c7182]
> 15: (clone()+0x6d) [0x7fed15b31fbd]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.

> --- logging levels ---
> 0/ 5 none
> 0/ 0 lockdep
> 0/ 0 context

> 0/ 0 crush
> 1/ 5 mds
> 1/ 5 mds_balancer
> 1/ 5 mds_locker
> 1/ 5 mds_log
> 1/ 5 mds_log_expire
> 1/ 5 mds_migrator
> 0/ 0 buffer
> 0/ 0 timer
> 0/ 1 filer
> 0/ 1 striper
> 0/ 1 objecter
> 0/ 5 rados
> 0/ 5 rbd
> 0/ 5 rbd_replay
> 0/ 0 journaler
> 0/ 5 objectcacher
> 0/ 5 client
> 0/ 0 osd
> 0/ 0 optracker
> 0/ 0 objclass
> 0/ 0 filestore
> 1/ 3 keyvaluestore

> 0/ 0 journal
> 10/10 ms
> 1/ 5 mon
> 0/ 0 monc
> 1/ 5 paxos
> 0/ 0 tp
> 0/ 0 auth
> 1/ 5 crypto
> 0/ 0 finisher
> 0/ 0 heartbeatmap
> 0/ 0 perfcounter
> 1/ 5 rgw
> 1/10 civetweb
> 1/ 5 javaclient
> 0/ 0 asok
> 0/ 0 throttle
> 0/ 0 refs
> -2/-2 (syslog threshold)
> -1/-1 (stderr threshold)
> max_recent 10000
> max_new 1000
> log_file /var/log/ceph/ceph-osd.35.log
> --- end dump of recent events ---

> The recurring theme here is that there is a communication issue
> between the osds.

> I looked carefully at my network hardware configuration (UCS C240s
> with 40Gbps Cisco VICs connected to a pair of Nexus 5672 using A-FEX
> Port Profile configuration) and couldn't find any dropped packets or
> errors.

> I ran "ss -s" for the first time on my osds and was a bit suprised to
> see how many open TCP connections they all have.

> ceph@osd6:/var/log/ceph$ ss -s
> Total: 1492 (kernel 0)
> TCP: 1411 (estab 1334, closed 40, orphaned 0, synrecv 0, timewait
> 0/0), ports 0

> Transport Total IP IPv6
> * 0 - -
> RAW 0 0 0
> UDP 10 4 6
> TCP 1371 1369 2
> INET 1381 1373 8
> FRAG 0 0 0

> While researching if additional kernel tuning would be required to
> handle so many connections, I eventually realized that I forgot to
> copy my customized /etc/sysctl.conf file on the two new nodes. I'm
> not sure if the large amount of TCP connections is part of the
> performance enhancements between Giant and Firefly, or if Firefly
> uses a similar number of connections.

> I highlighted the tuning parameters that I suspect are required.

> ceph@osd6:/var/log/ceph$ cat /etc/sysctl.conf
> #
> # /etc/sysctl.conf - Configuration file for setting system variables
> # See /etc/sysctl.d/ for additional system variables
> # See sysctl.conf (5) for information.
> #

> # Increase Linux autotuning TCP buffer limits
> # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for
> 10GE
> # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
> net.core.rmem_max = 56623104
> net.core.wmem_max = 56623104
> net.core.rmem_default = 56623104
> net.core.wmem_default = 56623104
> net.core.optmem_max = 40960
> net.ipv4.tcp_rmem = 4096 87380 56623104
> net.ipv4.tcp_wmem = 4096 65536 56623104

> # Make room for more TIME_WAIT sockets due to more clients,
> # and allow them to be reused if we run out of sockets
> # Also increase the max packet backlog
> net.core.somaxconn = 1024
> net.core.netdev_max_backlog = 50000
> net.ipv4.tcp_max_syn_backlog = 30000
> net.ipv4.tcp_max_tw_buckets = 2000000
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_fin_timeout = 10

> # Disable TCP slow start on idle connections
> net.ipv4.tcp_slow_start_after_idle = 0

> # If your servers talk UDP, also up these limits
> net.ipv4.udp_rmem_min = 8192
> net.ipv4.udp_wmem_min = 8192

> # Disable source routing and redirects
> net.ipv4.conf.all.send_redirects = 0
> net.ipv4.conf.all.accept_redirects = 0
> net.ipv4.conf.all.accept_source_route = 0

> I added the net.core.somaxconn after this experience, since the
> default is 128. This represents the allowed socket backlog in the
> kernel; which should help when I reboot an osd node and 1300
> connections need to be made quickly.

> I found that I needed to restart my osds after applying the kernel
> tuning above for my cluster to stabilize.

> My system is now stable again and performs very well.

> I hope this helps someone out. It took me a few days of
> troubleshooting to get to the bottom of this.

> Jake
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to