Following Jake's recommendation I have updated my sysctl.conf file and it seems 
to have helped with the problem of osds being marked down by other osd peers. 
It has been 3 days already. I am currently using the following settings in the 
sysctl.conf: 

# Increase Linux autotuning TCP buffer limits 
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE 
# Don't set tcp_mem itself! Let the kernel scale it based on RAM. 
net.core.rmem_max = 134217728 
net.core.wmem_max = 134217728 
net.core.rmem_default = 134217728 
net.core.wmem_default = 134217728 
net.core.optmem_max = 134217728 
net.ipv4.tcp_rmem = 4096 87380 67108864 
net.ipv4.tcp_wmem = 4096 65536 67108864 

# Make room for more TIME_WAIT sockets due to more clients, 
# and allow them to be reused if we run out of sockets 
# Also increase the max packet backlog 
net.core.somaxconn = 1024 
net.core.netdev_max_backlog = 250000 
net.ipv4.tcp_max_syn_backlog = 30000 
net.ipv4.tcp_max_tw_buckets = 2000000 
net.ipv4.tcp_tw_reuse = 1 
net.ipv4.tcp_fin_timeout = 10 

# Disable TCP slow start on idle connections 
net.ipv4.tcp_slow_start_after_idle = 0 

# If your servers talk UDP, also up these limits 
net.ipv4.udp_rmem_min = 8192 
net.ipv4.udp_wmem_min = 8192 

# Disable source routing and redirects 
net.ipv4.conf.all.send_redirects = 0 
net.ipv4.conf.all.accept_redirects = 0 
net.ipv4.conf.all.accept_source_route = 0 

# Mellanox recommended changes 
net.ipv4.tcp_timestamps = 0 
net.ipv4.tcp_sack = 1 
net.ipv4.tcp_low_latency = 1 
net.ipv4.tcp_adv_win_scale = 1 

Jake, thanks for your suggestions. 

Andrei 
----- Original Message -----

> From: "Jake Young" <jak3...@gmail.com>
> To: "Andrei Mikhailovsky" <and...@arhont.com>,
> ceph-users@lists.ceph.com
> Sent: Saturday, 6 December, 2014 5:02:15 PM
> Subject: Re: [ceph-users] Giant osd problems - loss of IO

> Forgot to copy the list.

> > I basically cobbled together the settings from examples on the
> > internet.
> 

> > I basically modified this sysctl.conf file with his suggestion for
> > 10gb nics
> 
> > http://www.nateware.com/linux-network-tuning-for-2013.html#.VIG_44eLTII
> 

> > I found these sites helpful as well:
> 

> > http://fasterdata.es.net/host-tuning/linux/
> 

> > This may be of interest to you, it has suggestions for your
> > Mellanox
> > hardware:
> > https://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/
> 

> > Fermilab website, link to university research papaer
> 
> > https://indico.fnal.gov/getFile.py/access?contribId=30&sessionId=19&resId=0&materialId=paper&confId=3377
> 

> > This has a great answer that explains different configurations for
> > servers vs clients. It seems to me that osds are both servers and
> > clients, so maybe some of the client tuning would benefit osds as
> > well. This is where I got the somaxconn setting from.
> 
> > http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
> 

> > I forgot to mention, I'm also setting the txqueuelen for my ceph
> > public nic and ceph private nic in the /etc/rc.local file:
> 

> > /sbin/ifconfig eth0 txqueuelen 10000
> 
> > /sbin/ifconfig eth1 txqueuelen 10000
> 

> > I do push the same sysctl.conf and rc.local to all of my clients as
> > well. The clients are iSCSI servers which serve vmware hosts. My
> > ceph cluster is rbd only and I currently only have the iSCSI proxy
> > server clients. We'll be adding some KVM hypervisors soon, I'm
> > interested to see how they perform vs my vmware --> iSCSI Server
> > -->
> > Ceph setup.
> 

> > Regarding your sysctl.conf file:
> 

> > I've read on a few different sites that net.ipv4.tcp_mem should not
> > be tuned, since the defaults are good. I have not set it, and I
> > can't speak to the benefit/problems with setting it.
> 

> > You're configured to only use a 4MB TCP buffer, which is very
> > small.
> > It is actually smaller than the defaults for tcp_wmem, which is
> > 6MB.
> > The link above suggests up to a 128MB TCP buffer for the 40gb
> > Mellanox and/or 10gb over a WAN (not sure how to read that). I'm
> > using a 54MB buffer, but I may increase mine to 128MB to see if
> > there is any benefit. That 4MB buffer may be your problem.
> 

> > Your net.core.netdev_max_backlog is 5x bigger than mine. I think
> > I'll
> > increase my setting to 250000 as well.
> 

> > Our issue looks like http://tracker.ceph.com/issues/9844 and my
> > crash
> > looks like http://tracker.ceph.com/issues/9788
> 

> > On Fri, Dec 5, 2014 at 5:35 AM, Andrei Mikhailovsky <
> > and...@arhont.com > wrote:
> 

> > > Jake,
> > 
> 

> > > very usefull indeed.
> > 
> 

> > > It looks like I had a similar problem regarding the heartbeat and
> > > as
> > > you' have mentioned, I've not seen such issues on Firefly.
> > > However,
> > > i've not seen any osd crashes.
> > 
> 

> > > Could you please let me know where you got the sysctrl.conf
> > > tunings
> > > from? Was it recommended by the network vendor?
> > 
> 

> > > Also, did you make similar sysctrl.conf changes to your host
> > > servers?
> > 
> 

> > > A while ago i've read the tunning guide for IP over Infiniband
> > > and
> > > the Mellanox recommends setting something like this:
> > 
> 

> > > net.ipv4.tcp_timestamps = 0
> > 
> 
> > > net.ipv4.tcp_sack = 1
> > 
> 
> > > net.core.netdev_max_backlog = 250000
> > 
> 
> > > net.core.rmem_max = 4194304
> > 
> 
> > > net.core.wmem_max = 4194304
> > 
> 
> > > net.core.rmem_default = 4194304
> > 
> 
> > > net.core.wmem_default = 4194304
> > 
> 
> > > net.core.optmem_max = 4194304
> > 
> 
> > > net.ipv4.tcp_rmem = 4096 87380 4194304
> > 
> 
> > > net.ipv4.tcp_wmem = 4096 65536 4194304
> > 
> 
> > > net.ipv4.tcp_mem =4194304 4194304 4194304
> > 
> 
> > > net.ipv4.tcp_low_latency=1
> > 
> 

> > > which is what I have. Not sure if these are optimal.
> > 
> 

> > > I can see that the values are pretty conservative compare to
> > > yours.
> > > I
> > > guess my values should be different as I am running a 40gbit/s
> > > network with ipoib. The actual throughput on ipoib is about
> > > 20gbit/s
> > > according iperf and alike.
> > 
> 

> > > Andrei
> > 
> 

> > > > From: "Jake Young" < jak3...@gmail.com >
> > > 
> > 
> 
> > > > To: "Andrei Mikhailovsky" < and...@arhont.com >
> > > 
> > 
> 
> > > > Cc: ceph-users@lists.ceph.com
> > > 
> > 
> 
> > > > Sent: Thursday, 4 December, 2014 4:57:47 PM
> > > 
> > 
> 
> > > > Subject: Re: [ceph-users] Giant osd problems - loss of IO
> > > 
> > 
> 

> > > > On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky <
> > > > and...@arhont.com > wrote:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Any other suggestions why several osds are going down on
> > > > > Giant
> > > > > and
> > > > > causing IO to stall? This was not happening on Firefly.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 

> > > > I had a very similar probem to yours which started after
> > > > upgrading
> > > > from Firefly to Giant and then later I added two new osd nodes,
> > > > with
> > > > 7 osds on each.
> > > 
> > 
> 

> > > > My cluster originally had 4 nodes, with 7 osds on each node, 28
> > > > osds
> > > > total, running Gian. I did not have any problems at this time.
> > > 
> > 
> 

> > > > My problems started after adding two new nodes, so I had 6
> > > > nodes
> > > > and
> > > > 42 total osds. It would run fine on low load, but when the
> > > > request
> > > > load increased, osds started to fall over.
> > > 
> > 
> 

> > > > I was able to set the debug_ms to 10 and capture the logs from
> > > > a
> > > > failed OSD. There were a few different reasons the osds were
> > > > going
> > > > down. This example shows it terminating normally for an
> > > > unspecified
> > > > reason a minute after it notices it is marked down in the map.
> > > 
> > 
> 

> > > > Osd 25 actually marks this osd (osd 35) down. For some reason
> > > > many
> > > > osds cannot communicate with each other.
> > > 
> > 
> 

> > > > There are other examples where I see the "heartbeat_check: no
> > > > reply
> > > > from osd.blah" message for long periods of time (hours) and
> > > > neither
> > > > osd crashes or terminates.
> > > 
> > 
> 

> > > > 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff
> > > > 2014-12-01
> > > > 16:26:46.772608)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff
> > > > 2014-12-01
> > > > 16:26:47.772759)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff
> > > > 2014-12-01
> > > > 16:26:48.772982)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:06.056972 (cutoff
> > > > 2014-12-01
> > > > 16:26:49.559891)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff
> > > > 2014-12-01
> > > > 16:26:49.773173)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff
> > > > 2014-12-01
> > > > 16:26:50.773299)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:09.559087 (cutoff
> > > > 2014-12-01
> > > > 16:26:51.261554)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff
> > > > 2014-12-01
> > > > 16:26:51.773504)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff
> > > > 2014-12-01
> > > > 16:26:52.773733)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff
> > > > 2014-12-01
> > > > 16:26:53.773876)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:11.260129 (cutoff
> > > > 2014-12-01
> > > > 16:26:54.163366)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.507632 7f8b4fb7f700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.5:6802/2755 pipe(0x2af06940 sd=57 :51521 s=2 pgs=384
> > > > cs=1
> > > > l=0 c=0x2af094a0).fault with nothing to send, going to standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.511704 7f8b37af1700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.2:6812/34015988 pipe(0x2af06c00 sd=69 :41512 s=2
> > > > pgs=38842
> > > > cs=1 l=0 c=0x2af09600).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.511966 7f8b5030c700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.4:6802/40022302 pipe(0x30cbcdc0 sd=93 :6802 s=2
> > > > pgs=66722
> > > > cs=3 l=0 c=0x2af091e0).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.514744 7f8b548a5700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.2:6800/9016639 pipe(0x2af04dc0 sd=38 :60965 s=2
> > > > pgs=11747
> > > > cs=1 l=0 c=0x2af086e0).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.516712 7f8b349c7700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.2:6802/25277 pipe(0x2b04cc00 sd=166 :6802 s=2 pgs=62
> > > > cs=1
> > > > l=0 c=0x2b043080).fault with nothing to send, going to standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.516814 7f8b2bd3b700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.4:6804/16770 pipe(0x30cbd600 sd=79 :6802 s=2 pgs=607
> > > > cs=3
> > > > l=0 c=0x2af08c60).fault with nothing to send, going to standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.518439 7f8b2a422700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.5:6806/31172 pipe(0x30cbc840 sd=28 :6802 s=2 pgs=22
> > > > cs=1
> > > > l=0
> > > > c=0x3041f5a0).fault with nothing to send, going to standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.518883 7f8b589ba700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.1:6803/4031631 pipe(0x2af042c0 sd=32 :58296 s=2
> > > > pgs=35500
> > > > cs=3 l=0 c=0x2af08160).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.519271 7f8b5040d700 0 --
> > > > 172.1.2.6:6802/5210
> > > > >>
> > > > 172.1.2.2:6816/32028847 pipe(0x2af05e40 sd=49 :54016 s=2
> > > > pgs=30500
> > > > cs=1 l=0 c=0x2af08f20).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:14.774081 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff
> > > > 2014-12-01
> > > > 16:26:54.774073)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:15.774290 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff
> > > > 2014-12-01
> > > > 16:26:55.774281)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:16.774480 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff
> > > > 2014-12-01
> > > > 16:26:56.774471)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:17.774670 7f8b642d1700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff
> > > > 2014-12-01
> > > > 16:26:57.774661)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:18.264884 7f8b3b1fe700 -1 osd.35 79679
> > > > heartbeat_check: no reply from osd.25 since back 2014-12-01
> > > > 16:25:51.310319 front 2014-12-01 16:27:14.161820 (cutoff
> > > > 2014-12-01
> > > > 16:26:58.264882)
> > > 
> > 
> 
> > > > 2014-12-01 16:27:18.268852 7f8b4c220700 0 log_channel(default)
> > > > log
> > > > [WRN] : map e79681 wrongly marked me down
> > > 
> > 
> 
> > > > 2014-12-01 16:27:22.290362 7f8b37df4700 0 --
> > > > 172.1.2.6:6806/1005210
> > > > >> 172.1.2.4:6804/16770 pipe(0x2c2ef8c0 sd=75 :44216 s=2
> > > > >> pgs=632
> > > > cs=1 l=0 c=0x1bce6940).fault with nothing to send, going to
> > > > standby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:22.866677 7f8b56662700 0 --
> > > > 172.1.2.6:6806/1005210
> > > > >> 172.1.2.2:6808/9013936 pipe(0x2f5e4840 sd=55 :41925 s=2
> > > > >> pgs=15111
> > > > cs=1 l=0 c=0x30828580).fault with nothing to send, going to
> > > > stan
> > > 
> > 
> 
> > > > dby
> > > 
> > 
> 
> > > > 2014-12-01 16:27:24.854642 7f8b2ed6b700 0 --
> > > > 172.1.2.6:6806/1005210
> > > > >> 172.1.2.6:6816/62000664 pipe(0x2c3c78c0 sd=206 :6806 s=0
> > > > >> pgs=0
> > > > cs=0 l=0 c=0x2ccb3e40).accept connect_seq 0 vs existing 0 state
> > > > con
> > > 
> > 
> 
> > > > necting
> > > 
> > 
> 
> > > > 2014-12-01 16:27:25.265306 7f8b2af2d700 0 --
> > > > 172.1.1.6:6806/5210
> > > > >>
> > > > 172.1.1.54:0/586219983 pipe(0x2d46bc80 sd=246 :6806 s=0 pgs=0
> > > > cs=0
> > > > l=0 c=0x1bdacdc0).accept peer addr is really
> > > > 172.1.1.54:0/58621998
> > > 
> > 
> 
> > > > 3 (socket is 172.1.1.54:36423/0 )
> > > 
> > 
> 
> > > > 2014-12-01 16:28:45.732468 7f8b368df700 -1 osd.35 79691 *** Got
> > > > signal Terminated ***
> > > 
> > 
> 
> > > > 2014-12-01 16:28:45.732591 7f8b368df700 0 osd.35 79691
> > > > prepare_to_stop telling mon we are shutting down
> > > 
> > 
> 
> > > > 2014-12-01 16:28:46.586316 7f8b2236a700 0 --
> > > > 172.1.2.6:6806/1005210
> > > > >> 172.1.2.1:6807/91014386 pipe(0x1cabb700 sd=32 :53651 s=2
> > > > pgs=37459 cs=1 l=0 c=0x1bce5e40).fault with nothing to send,
> > > > going
> > > > to sta
> > > 
> > 
> 
> > > > ndby
> > > 
> > 
> 
> > > > 2014-12-01 16:28:46.593615 7f8b4c220700 0 osd.35 79691
> > > > got_stop_ack
> > > > starting shutdown
> > > 
> > 
> 
> > > > 2014-12-01 16:28:46.593662 7f8b368df700 0 osd.35 79691
> > > > prepare_to_stop starting shutdown
> > > 
> > 
> 
> > > > 2014-12-01 16:28:46.593682 7f8b368df700 -1 osd.35 79691
> > > > shutdown
> > > 
> > 
> 

> > > > Another example I found shows this same osd crashing with "hit
> > > > suicide timeout".
> > > 
> > 
> 

> > > > -4> 2014-12-01 15:50:52.333350 7fecda368700 10 --
> > > > 172.1.2.6:6805/1031451 >> 172.1.2.1:0/32541 pipe(0x2b70a680
> > > > sd=104
> > > > :6805 s=2 pgs=2692 cs=1 l=1 c=0x2d754680).writer: state = open
> > > > policy.server=1
> > > 
> > 
> 
> > > > -3> 2014-12-01 15:50:52.333348 7fecd9d62700 10 --
> > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80
> > > > sd=94
> > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open
> > > > policy.server=1
> > > 
> > 
> 
> > > > -2> 2014-12-01 15:50:52.333369 7fecd9d62700 10 --
> > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80
> > > > sd=94
> > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).write_ack 10
> > > 
> > 
> 
> > > > -1> 2014-12-01 15:50:52.333386 7fecd9d62700 10 --
> > > > 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80
> > > > sd=94
> > > > :6819 s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open
> > > > policy.server=1
> > > 
> > 
> 
> > > > 0> 2014-12-01 15:50:52.531714 7fed13221700 -1
> > > > common/HeartbeatMap.cc:
> > > > In function 'bool
> > > > ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
> > > > char*,
> > > > time_t)' thread 7fed13221700 time 2014-12-
> > > 
> > 
> 
> > > > 01 15:50:52.508265
> > > 
> > 
> 
> > > > common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide
> > > > timeout")
> > > 
> > 
> 

> > > > ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
> > > 
> > 
> 
> > > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
> > > > char
> > > > const*)+0x8b) [0xb8231b]
> > > 
> > 
> 
> > > > 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > > > const*, long)+0x2a9) [0xac0e19]
> > > 
> > 
> 
> > > > 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
> > > 
> > 
> 
> > > > 4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
> > > 
> > 
> 
> > > > 5: (CephContextServiceThread::entry()+0x154) [0xb96844]
> > > 
> > 
> 
> > > > 6: (()+0x8182) [0x7fed175c7182]
> > > 
> > 
> 
> > > > 7: (clone()+0x6d) [0x7fed15b31fbd]
> > > 
> > 
> 
> > > > NOTE: a copy of the executable, or `objdump -rdS <executable>`
> > > > is
> > > > needed to interpret this.
> > > 
> > 
> 

> > > > --- logging levels ---
> > > 
> > 
> 
> > > > 0/ 5 none
> > > 
> > 
> 
> > > > 0/ 0 lockdep
> > > 
> > 
> 
> > > > 0/ 0 context
> > > 
> > 
> 
> > > > 0/ 0 crush
> > > 
> > 
> 

> > > > 1/ 5 mds
> > > 
> > 
> 
> > > > 1/ 5 mds_balancer
> > > 
> > 
> 
> > > > 1/ 5 mds_locker
> > > 
> > 
> 
> > > > 1/ 5 mds_log
> > > 
> > 
> 
> > > > 1/ 5 mds_log_expire
> > > 
> > 
> 
> > > > 1/ 5 mds_migrator
> > > 
> > 
> 
> > > > 0/ 0 buffer
> > > 
> > 
> 
> > > > 0/ 0 timer
> > > 
> > 
> 
> > > > 0/ 1 filer
> > > 
> > 
> 
> > > > 0/ 1 striper
> > > 
> > 
> 
> > > > 0/ 1 objecter
> > > 
> > 
> 
> > > > 0/ 5 rados
> > > 
> > 
> 
> > > > 0/ 5 rbd
> > > 
> > 
> 
> > > > 0/ 5 rbd_replay
> > > 
> > 
> 
> > > > 0/ 0 journaler
> > > 
> > 
> 
> > > > 0/ 5 objectcacher
> > > 
> > 
> 
> > > > 0/ 5 client
> > > 
> > 
> 
> > > > 0/ 0 osd
> > > 
> > 
> 
> > > > 0/ 0 optracker
> > > 
> > 
> 
> > > > 0/ 0 objclass
> > > 
> > 
> 
> > > > 0/ 0 filestore
> > > 
> > 
> 
> > > > 1/ 3 keyvaluestore
> > > 
> > 
> 
> > > > 0/ 0 journal
> > > 
> > 
> 
> > > > 10/10 ms
> > > 
> > 
> 
> > > > 1/ 5 mon
> > > 
> > 
> 
> > > > 0/ 0 monc
> > > 
> > 
> 
> > > > 1/ 5 paxos
> > > 
> > 
> 
> > > > 0/ 0 tp
> > > 
> > 
> 
> > > > 0/ 0 auth
> > > 
> > 
> 
> > > > 1/ 5 crypto
> > > 
> > 
> 
> > > > 0/ 0 finisher
> > > 
> > 
> 
> > > > 0/ 0 heartbeatmap
> > > 
> > 
> 
> > > > 0/ 0 perfcounter
> > > 
> > 
> 
> > > > 1/ 5 rgw
> > > 
> > 
> 
> > > > 1/10 civetweb
> > > 
> > 
> 
> > > > 1/ 5 javaclient
> > > 
> > 
> 
> > > > 0/ 0 asok
> > > 
> > 
> 
> > > > 0/ 0 throttle
> > > 
> > 
> 
> > > > 0/ 0 refs
> > > 
> > 
> 
> > > > -2/-2 (syslog threshold)
> > > 
> > 
> 
> > > > -1/-1 (stderr threshold)
> > > 
> > 
> 
> > > > max_recent 10000
> > > 
> > 
> 
> > > > max_new 1000
> > > 
> > 
> 
> > > > log_file /var/log/ceph/ceph-osd.35.log
> > > 
> > 
> 
> > > > --- end dump of recent events ---
> > > 
> > 
> 
> > > > 2014-12-01 15:50:52.627789 7fed13221700 -1 *** Caught signal
> > > > (Aborted) **
> > > 
> > 
> 
> > > > in thread 7fed13221700
> > > 
> > 
> 

> > > > ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
> > > 
> > 
> 
> > > > 1: /usr/bin/ceph-osd() [0xa9767a]
> > > 
> > 
> 
> > > > 2: (()+0x10340) [0x7fed175cf340]
> > > 
> > 
> 
> > > > 3: (gsignal()+0x39) [0x7fed15a6dbb9]
> > > 
> > 
> 
> > > > 4: (abort()+0x148) [0x7fed15a70fc8]
> > > 
> > 
> 
> > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155)
> > > > [0x7fed163796b5]
> > > 
> > 
> 
> > > > 6: (()+0x5e836) [0x7fed16377836]
> > > 
> > 
> 
> > > > 7: (()+0x5e863) [0x7fed16377863]
> > > 
> > 
> 
> > > > 8: (()+0x5eaa2) [0x7fed16377aa2]
> > > 
> > 
> 
> > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int,
> > > > char
> > > > const*)+0x278) [0xb82508]
> > > 
> > 
> 
> > > > 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > > > const*, long)+0x2a9) [0xac0e19]
> > > 
> > 
> 
> > > > 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
> > > 
> > 
> 
> > > > 12: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
> > > 
> > 
> 
> > > > 13: (CephContextServiceThread::entry()+0x154) [0xb96844]
> > > 
> > 
> 
> > > > 14: (()+0x8182) [0x7fed175c7182]
> > > 
> > 
> 
> > > > 15: (clone()+0x6d) [0x7fed15b31fbd]
> > > 
> > 
> 
> > > > NOTE: a copy of the executable, or `objdump -rdS <executable>`
> > > > is
> > > > needed to interpret this.
> > > 
> > 
> 

> > > > --- logging levels ---
> > > 
> > 
> 
> > > > 0/ 5 none
> > > 
> > 
> 
> > > > 0/ 0 lockdep
> > > 
> > 
> 
> > > > 0/ 0 context
> > > 
> > 
> 

> > > > 0/ 0 crush
> > > 
> > 
> 
> > > > 1/ 5 mds
> > > 
> > 
> 
> > > > 1/ 5 mds_balancer
> > > 
> > 
> 
> > > > 1/ 5 mds_locker
> > > 
> > 
> 
> > > > 1/ 5 mds_log
> > > 
> > 
> 
> > > > 1/ 5 mds_log_expire
> > > 
> > 
> 
> > > > 1/ 5 mds_migrator
> > > 
> > 
> 
> > > > 0/ 0 buffer
> > > 
> > 
> 
> > > > 0/ 0 timer
> > > 
> > 
> 
> > > > 0/ 1 filer
> > > 
> > 
> 
> > > > 0/ 1 striper
> > > 
> > 
> 
> > > > 0/ 1 objecter
> > > 
> > 
> 
> > > > 0/ 5 rados
> > > 
> > 
> 
> > > > 0/ 5 rbd
> > > 
> > 
> 
> > > > 0/ 5 rbd_replay
> > > 
> > 
> 
> > > > 0/ 0 journaler
> > > 
> > 
> 
> > > > 0/ 5 objectcacher
> > > 
> > 
> 
> > > > 0/ 5 client
> > > 
> > 
> 
> > > > 0/ 0 osd
> > > 
> > 
> 
> > > > 0/ 0 optracker
> > > 
> > 
> 
> > > > 0/ 0 objclass
> > > 
> > 
> 
> > > > 0/ 0 filestore
> > > 
> > 
> 
> > > > 1/ 3 keyvaluestore
> > > 
> > 
> 

> > > > 0/ 0 journal
> > > 
> > 
> 
> > > > 10/10 ms
> > > 
> > 
> 
> > > > 1/ 5 mon
> > > 
> > 
> 
> > > > 0/ 0 monc
> > > 
> > 
> 
> > > > 1/ 5 paxos
> > > 
> > 
> 
> > > > 0/ 0 tp
> > > 
> > 
> 
> > > > 0/ 0 auth
> > > 
> > 
> 
> > > > 1/ 5 crypto
> > > 
> > 
> 
> > > > 0/ 0 finisher
> > > 
> > 
> 
> > > > 0/ 0 heartbeatmap
> > > 
> > 
> 
> > > > 0/ 0 perfcounter
> > > 
> > 
> 
> > > > 1/ 5 rgw
> > > 
> > 
> 
> > > > 1/10 civetweb
> > > 
> > 
> 
> > > > 1/ 5 javaclient
> > > 
> > 
> 
> > > > 0/ 0 asok
> > > 
> > 
> 
> > > > 0/ 0 throttle
> > > 
> > 
> 
> > > > 0/ 0 refs
> > > 
> > 
> 
> > > > -2/-2 (syslog threshold)
> > > 
> > 
> 
> > > > -1/-1 (stderr threshold)
> > > 
> > 
> 
> > > > max_recent 10000
> > > 
> > 
> 
> > > > max_new 1000
> > > 
> > 
> 
> > > > log_file /var/log/ceph/ceph-osd.35.log
> > > 
> > 
> 
> > > > --- end dump of recent events ---
> > > 
> > 
> 

> > > > The recurring theme here is that there is a communication issue
> > > > between the osds.
> > > 
> > 
> 

> > > > I looked carefully at my network hardware configuration (UCS
> > > > C240s
> > > > with 40Gbps Cisco VICs connected to a pair of Nexus 5672 using
> > > > A-FEX
> > > > Port Profile configuration) and couldn't find any dropped
> > > > packets
> > > > or
> > > > errors.
> > > 
> > 
> 

> > > > I ran "ss -s" for the first time on my osds and was a bit
> > > > suprised
> > > > to
> > > > see how many open TCP connections they all have.
> > > 
> > 
> 

> > > > ceph@osd6:/var/log/ceph$ ss -s
> > > 
> > 
> 
> > > > Total: 1492 (kernel 0)
> > > 
> > 
> 
> > > > TCP: 1411 (estab 1334, closed 40, orphaned 0, synrecv 0,
> > > > timewait
> > > > 0/0), ports 0
> > > 
> > 
> 

> > > > Transport Total IP IPv6
> > > 
> > 
> 
> > > > * 0 - -
> > > 
> > 
> 
> > > > RAW 0 0 0
> > > 
> > 
> 
> > > > UDP 10 4 6
> > > 
> > 
> 
> > > > TCP 1371 1369 2
> > > 
> > 
> 
> > > > INET 1381 1373 8
> > > 
> > 
> 
> > > > FRAG 0 0 0
> > > 
> > 
> 

> > > > While researching if additional kernel tuning would be required
> > > > to
> > > > handle so many connections, I eventually realized that I forgot
> > > > to
> > > > copy my customized /etc/sysctl.conf file on the two new nodes.
> > > > I'm
> > > > not sure if the large amount of TCP connections is part of the
> > > > performance enhancements between Giant and Firefly, or if
> > > > Firefly
> > > > uses a similar number of connections.
> > > 
> > 
> 

> > > > I highlighted the tuning parameters that I suspect are
> > > > required.
> > > 
> > 
> 

> > > > ceph@osd6:/var/log/ceph$ cat /etc/sysctl.conf
> > > 
> > 
> 
> > > > #
> > > 
> > 
> 
> > > > # /etc/sysctl.conf - Configuration file for setting system
> > > > variables
> > > 
> > 
> 
> > > > # See /etc/sysctl.d/ for additional system variables
> > > 
> > 
> 
> > > > # See sysctl.conf (5) for information.
> > > 
> > 
> 
> > > > #
> > > 
> > 
> 

> > > > # Increase Linux autotuning TCP buffer limits
> > > 
> > 
> 
> > > > # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104)
> > > > for
> > > > 10GE
> > > 
> > 
> 
> > > > # Don't set tcp_mem itself! Let the kernel scale it based on
> > > > RAM.
> > > 
> > 
> 
> > > > net.core.rmem_max = 56623104
> > > 
> > 
> 
> > > > net.core.wmem_max = 56623104
> > > 
> > 
> 
> > > > net.core.rmem_default = 56623104
> > > 
> > 
> 
> > > > net.core.wmem_default = 56623104
> > > 
> > 
> 
> > > > net.core.optmem_max = 40960
> > > 
> > 
> 
> > > > net.ipv4.tcp_rmem = 4096 87380 56623104
> > > 
> > 
> 
> > > > net.ipv4.tcp_wmem = 4096 65536 56623104
> > > 
> > 
> 

> > > > # Make room for more TIME_WAIT sockets due to more clients,
> > > 
> > 
> 
> > > > # and allow them to be reused if we run out of sockets
> > > 
> > 
> 
> > > > # Also increase the max packet backlog
> > > 
> > 
> 
> > > > net.core.somaxconn = 1024
> > > 
> > 
> 
> > > > net.core.netdev_max_backlog = 50000
> > > 
> > 
> 
> > > > net.ipv4.tcp_max_syn_backlog = 30000
> > > 
> > 
> 
> > > > net.ipv4.tcp_max_tw_buckets = 2000000
> > > 
> > 
> 
> > > > net.ipv4.tcp_tw_reuse = 1
> > > 
> > 
> 
> > > > net.ipv4.tcp_fin_timeout = 10
> > > 
> > 
> 

> > > > # Disable TCP slow start on idle connections
> > > 
> > 
> 
> > > > net.ipv4.tcp_slow_start_after_idle = 0
> > > 
> > 
> 

> > > > # If your servers talk UDP, also up these limits
> > > 
> > 
> 
> > > > net.ipv4.udp_rmem_min = 8192
> > > 
> > 
> 
> > > > net.ipv4.udp_wmem_min = 8192
> > > 
> > 
> 

> > > > # Disable source routing and redirects
> > > 
> > 
> 
> > > > net.ipv4.conf.all.send_redirects = 0
> > > 
> > 
> 
> > > > net.ipv4.conf.all.accept_redirects = 0
> > > 
> > 
> 
> > > > net.ipv4.conf.all.accept_source_route = 0
> > > 
> > 
> 

> > > > I added the net.core.somaxconn after this experience, since the
> > > > default is 128. This represents the allowed socket backlog in
> > > > the
> > > > kernel; which should help when I reboot an osd node and 1300
> > > > connections need to be made quickly.
> > > 
> > 
> 

> > > > I found that I needed to restart my osds after applying the
> > > > kernel
> > > > tuning above for my cluster to stabilize.
> > > 
> > 
> 

> > > > My system is now stable again and performs very well.
> > > 
> > 
> 

> > > > I hope this helps someone out. It took me a few days of
> > > > troubleshooting to get to the bottom of this.
> > > 
> > 
> 

> > > > Jake
> > > 
> > 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to