Re: [ceph-users] Giant osd problems - loss of IO

Jake Young Sat, 06 Dec 2014 09:02:55 -0800

Forgot to copy the list.


I basically cobbled together the settings from examples on the internet.
>
> I basically modified this sysctl.conf file with his suggestion for 10gb
> nics
> http://www.nateware.com/linux-network-tuning-for-2013.html#.VIG_44eLTII
>
> I found these sites helpful as well:
>
> http://fasterdata.es.net/host-tuning/linux/
>
> This may be of interest to you, it has suggestions for your Mellanox
> hardware:
> https://fasterdata.es.net/host-tuning/nic-tuning/mellanox-connectx-3/
>
> Fermilab website, link to university research papaer
>
> https://indico.fnal.gov/getFile.py/access?contribId=30&sessionId=19&resId=0&materialId=paper&confId=3377
>
> This has a great answer that explains different configurations for servers
> vs clients.  It seems to me that osds are both servers and clients, so
> maybe some of the client tuning would benefit osds as well.  This is where
> I got the somaxconn setting from.
>
> http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
>
>
> I forgot to mention, I'm also setting the txqueuelen for my ceph public
> nic and ceph private nic in the /etc/rc.local file:
> /sbin/ifconfig eth0 txqueuelen 10000
> /sbin/ifconfig eth1 txqueuelen 10000
>
>
>
> I do push the same sysctl.conf and rc.local to all of my clients as well.
> The clients are iSCSI servers which serve vmware hosts.  My ceph cluster is
> rbd only and I currently only have the iSCSI proxy server clients.  We'll
> be adding some KVM hypervisors soon, I'm interested to see how they perform
> vs my vmware --> iSCSI Server --> Ceph setup.
>
>
> Regarding your sysctl.conf file:
>
> I've read on a few different sites that net.ipv4.tcp_mem should not be
> tuned, since the defaults are good.  I have not set it, and I can't speak
> to the benefit/problems with setting it.
>
> You're configured to only use a 4MB TCP buffer, which is very small.  It
> is actually smaller than the defaults for tcp_wmem, which is 6MB.  The link
> above suggests up to a 128MB TCP buffer for the 40gb Mellanox and/or 10gb
> over a WAN (not sure how to read that).  I'm using a 54MB buffer, but I may
> increase mine to 128MB to see if there is any benefit.  That 4MB buffer may
> be your problem.
>
> Your net.core.netdev_max_backlog is 5x bigger than mine.  I think I'll
> increase my setting to 250000 as well.
>
> Our issue looks like http://tracker.ceph.com/issues/9844 and my crash
> looks like http://tracker.ceph.com/issues/9788
>
>
>
> On Fri, Dec 5, 2014 at 5:35 AM, Andrei Mikhailovsky <and...@arhont.com
> <javascript:_e(%7B%7D,'cvml','and...@arhont.com');>> wrote:
>
>> Jake,
>>
>> very usefull indeed.
>>
>> It looks like I had a similar problem regarding the heartbeat and as you'
>> have mentioned, I've not seen such issues on Firefly. However, i've not
>> seen any osd crashes.
>>
>>
>>
>> Could you please let me know where you got the sysctrl.conf tunings from?
>> Was it recommended by the network vendor?
>>
>> Also, did you make similar sysctrl.conf changes to your host servers?
>>
>> A while ago i've read the tunning guide for IP over Infiniband and the
>> Mellanox recommends setting something like this:
>>
>> net.ipv4.tcp_timestamps = 0
>> net.ipv4.tcp_sack = 1
>> net.core.netdev_max_backlog = 250000
>> net.core.rmem_max = 4194304
>> net.core.wmem_max = 4194304
>> net.core.rmem_default = 4194304
>> net.core.wmem_default = 4194304
>> net.core.optmem_max = 4194304
>> net.ipv4.tcp_rmem = 4096 87380 4194304
>> net.ipv4.tcp_wmem = 4096 65536 4194304
>> net.ipv4.tcp_mem =4194304 4194304 4194304
>> net.ipv4.tcp_low_latency=1
>>
>>
>> which is what I have. Not sure if these are optimal.
>>
>> I can see that the values are pretty conservative compare to yours. I
>> guess my values should be different as I am running a 40gbit/s network with
>> ipoib. The actual throughput on ipoib is about 20gbit/s according iperf and
>> alike.
>>
>> Andrei
>>
>>
>> ------------------------------
>>
>> *From: *"Jake Young" <jak3...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');>>
>> *To: *"Andrei Mikhailovsky" <and...@arhont.com
>> <javascript:_e(%7B%7D,'cvml','and...@arhont.com');>>
>> *Cc: *ceph-users@lists.ceph.com
>> <javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');>
>> *Sent: *Thursday, 4 December, 2014 4:57:47 PM
>> *Subject: *Re: [ceph-users] Giant osd problems - loss of IO
>>
>>
>>
>> On Fri, Nov 14, 2014 at 4:38 PM, Andrei Mikhailovsky <and...@arhont.com
>> <javascript:_e(%7B%7D,'cvml','and...@arhont.com');>> wrote:
>> >
>> > Any other suggestions why several osds are going down on Giant and
>> causing IO to stall? This was not happening on Firefly.
>> >
>> > Thanks
>> >
>> >
>>
>> I had a very similar probem to yours which started after upgrading from
>> Firefly to Giant and then later  I added two new osd nodes, with 7 osds on
>> each.
>>
>> My cluster originally had 4 nodes, with 7 osds on each node, 28 osds
>> total, running Gian.  I did not have any problems at this time.
>>
>> My problems started after adding two new nodes, so I had 6 nodes and 42
>> total osds.  It would run fine on low load, but when the request load
>> increased, osds started to fall over.
>>
>>
>> I was able to set the debug_ms to 10 and capture the logs from a failed
>> OSD.  There were a few different reasons the osds were going down.  This
>> example shows it terminating normally for an unspecified reason a minute
>> after it notices it is marked down in the map.
>>
>> Osd 25 actually marks this osd (osd 35) down.  For some reason many osds
>> cannot communicate with each other.
>>
>> There are other examples where I see the "heartbeat_check: no reply from
>> osd.blah" message for long periods of time (hours) and neither osd crashes
>> or terminates.
>>
>> 2014-12-01 16:27:06.772616 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:06.056972 (cutoff 2014-12-01 16:26:46.772608)
>> 2014-12-01 16:27:07.772767 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:06.056972 (cutoff 2014-12-01 16:26:47.772759)
>> 2014-12-01 16:27:08.772990 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:06.056972 (cutoff 2014-12-01 16:26:48.772982)
>> 2014-12-01 16:27:09.559894 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:06.056972 (cutoff 2014-12-01 16:26:49.559891)
>> 2014-12-01 16:27:09.773177 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:09.559087 (cutoff 2014-12-01 16:26:49.773173)
>> 2014-12-01 16:27:10.773307 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:09.559087 (cutoff 2014-12-01 16:26:50.773299)
>> 2014-12-01 16:27:11.261557 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:09.559087 (cutoff 2014-12-01 16:26:51.261554)
>> 2014-12-01 16:27:11.773512 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:11.260129 (cutoff 2014-12-01 16:26:51.773504)
>> 2014-12-01 16:27:12.773741 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:11.260129 (cutoff 2014-12-01 16:26:52.773733)
>> 2014-12-01 16:27:13.773884 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:11.260129 (cutoff 2014-12-01 16:26:53.773876)
>> 2014-12-01 16:27:14.163369 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:11.260129 (cutoff 2014-12-01 16:26:54.163366)
>> 2014-12-01 16:27:14.507632 7f8b4fb7f700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.5:6802/2755 pipe(0x2af06940 sd=57 :51521 s=2 pgs=384 cs=1 l=0
>> c=0x2af094a0).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.511704 7f8b37af1700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.2:6812/34015988 pipe(0x2af06c00 sd=69 :41512 s=2 pgs=38842 cs=1
>> l=0 c=0x2af09600).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.511966 7f8b5030c700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.4:6802/40022302 pipe(0x30cbcdc0 sd=93 :6802 s=2 pgs=66722 cs=3
>> l=0 c=0x2af091e0).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.514744 7f8b548a5700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.2:6800/9016639 pipe(0x2af04dc0 sd=38 :60965 s=2 pgs=11747 cs=1
>> l=0 c=0x2af086e0).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.516712 7f8b349c7700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.2:6802/25277 pipe(0x2b04cc00 sd=166 :6802 s=2 pgs=62 cs=1 l=0
>> c=0x2b043080).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.516814 7f8b2bd3b700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.4:6804/16770 pipe(0x30cbd600 sd=79 :6802 s=2 pgs=607 cs=3 l=0
>> c=0x2af08c60).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.518439 7f8b2a422700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.5:6806/31172 pipe(0x30cbc840 sd=28 :6802 s=2 pgs=22 cs=1 l=0
>> c=0x3041f5a0).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.518883 7f8b589ba700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.1:6803/4031631 pipe(0x2af042c0 sd=32 :58296 s=2 pgs=35500 cs=3
>> l=0 c=0x2af08160).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.519271 7f8b5040d700  0 -- 172.1.2.6:6802/5210 >>
>> 172.1.2.2:6816/32028847 pipe(0x2af05e40 sd=49 :54016 s=2 pgs=30500 cs=1
>> l=0 c=0x2af08f20).fault with nothing to send, going to standby
>> 2014-12-01 16:27:14.774081 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:14.161820 (cutoff 2014-12-01 16:26:54.774073)
>> 2014-12-01 16:27:15.774290 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:14.161820 (cutoff 2014-12-01 16:26:55.774281)
>> 2014-12-01 16:27:16.774480 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:14.161820 (cutoff 2014-12-01 16:26:56.774471)
>> 2014-12-01 16:27:17.774670 7f8b642d1700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:14.161820 (cutoff 2014-12-01 16:26:57.774661)
>> 2014-12-01 16:27:18.264884 7f8b3b1fe700 -1 osd.35 79679 heartbeat_check:
>> no reply from osd.25 since back 2014-12-01 16:25:51.310319 front 2014-12-01
>> 16:27:14.161820 (cutoff 2014-12-01 16:26:58.264882)
>> 2014-12-01 16:27:18.268852 7f8b4c220700  0 log_channel(default) log [WRN]
>> : *map e79681 wrongly marked me down*
>> 2014-12-01 16:27:22.290362 7f8b37df4700  0 -- 172.1.2.6:6806/1005210 >>
>> 172.1.2.4:6804/16770 pipe(0x2c2ef8c0 sd=75 :44216 s=2 pgs=632 cs=1 l=0
>> c=0x1bce6940).fault with nothing to send, going to standby
>> 2014-12-01 16:27:22.866677 7f8b56662700  0 -- 172.1.2.6:6806/1005210 >>
>> 172.1.2.2:6808/9013936 pipe(0x2f5e4840 sd=55 :41925 s=2 pgs=15111 cs=1
>> l=0 c=0x30828580).fault with nothing to send, going to stan
>> dby
>> 2014-12-01 16:27:24.854642 7f8b2ed6b700  0 -- 172.1.2.6:6806/1005210 >>
>> 172.1.2.6:6816/62000664 pipe(0x2c3c78c0 sd=206 :6806 s=0 pgs=0 cs=0 l=0
>> c=0x2ccb3e40).accept connect_seq 0 vs existing 0 state con
>> necting
>> 2014-12-01 16:27:25.265306 7f8b2af2d700  0 -- 172.1.1.6:6806/5210 >>
>> 172.1.1.54:0/586219983 pipe(0x2d46bc80 sd=246 :6806 s=0 pgs=0 cs=0 l=0
>> c=0x1bdacdc0).accept peer addr is really 172.1.1.54:0/58621998
>> 3 (socket is 172.1.1.54:36423/0)
>> 2014-12-01 16:28:45.732468 7f8b368df700 -1 osd.35 79691 **** Got signal
>> Terminated ****
>> 2014-12-01 16:28:45.732591 7f8b368df700  0 osd.35 79691 prepare_to_stop
>> telling mon we are shutting down
>> 2014-12-01 16:28:46.586316 7f8b2236a700  0 -- 172.1.2.6:6806/1005210 >>
>> 172.1.2.1:6807/91014386 pipe(0x1cabb700 sd=32 :53651 s=2 pgs=37459 cs=1
>> l=0 c=0x1bce5e40).fault with nothing to send, going to sta
>> ndby
>> 2014-12-01 16:28:46.593615 7f8b4c220700  0 osd.35 79691 got_stop_ack
>> starting shutdown
>> 2014-12-01 16:28:46.593662 7f8b368df700  0 osd.35 79691 prepare_to_stop
>> starting shutdown
>> 2014-12-01 16:28:46.593682 7f8b368df700 -1 osd.35 79691 shutdown
>>
>>
>>
>> Another example I found shows this same osd crashing with "hit suicide
>> timeout".
>>
>>     -4> 2014-12-01 15:50:52.333350 7fecda368700 10 --
>> 172.1.2.6:6805/1031451 >> 172.1.2.1:0/32541 pipe(0x2b70a680 sd=104 :6805
>> s=2 pgs=2692 cs=1 l=1 c=0x2d754680).writer: state = open policy.server=1
>>     -3> 2014-12-01 15:50:52.333348 7fecd9d62700 10 --
>> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94 :6819
>> s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open policy.server=1
>>     -2> 2014-12-01 15:50:52.333369 7fecd9d62700 10 --
>> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94 :6819
>> s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).write_ack 10
>>     -1> 2014-12-01 15:50:52.333386 7fecd9d62700 10 --
>> 172.1.1.6:6819/1031451 >> 172.1.2.1:0/32541 pipe(0x4065c80 sd=94 :6819
>> s=2 pgs=2689 cs=1 l=1 c=0x2d7538c0).writer: state = open policy.server=1
>>      0> 2014-12-01 15:50:52.531714 7fed13221700 -1
>> common/HeartbeatMap.cc: In function 'bool
>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)'
>> thread 7fed13221700 time 2014-12-
>> 01 15:50:52.508265
>> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>
>>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0xb8231b]
>>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> long)+0x2a9) [0xac0e19]
>>  3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
>>  4: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
>>  5: (CephContextServiceThread::entry()+0x154) [0xb96844]
>>  6: (()+0x8182) [0x7fed175c7182]
>>  7: (clone()+0x6d) [0x7fed15b31fbd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 0 lockdep
>>    0/ 0 context
>>    0/ 0 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 0 buffer
>>    0/ 0 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 rbd_replay
>>    0/ 0 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 0 osd
>>    0/ 0 optracker
>>    0/ 0 objclass
>>    0/ 0 filestore
>>    1/ 3 keyvaluestore
>>    0/ 0 journal
>> 10/10 ms
>>    1/ 5 mon
>>    0/ 0 monc
>>    1/ 5 paxos
>>    0/ 0 tp
>>    0/ 0 auth
>>    1/ 5 crypto
>>    0/ 0 finisher
>>    0/ 0 heartbeatmap
>>    0/ 0 perfcounter
>>    1/ 5 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    0/ 0 asok
>>    0/ 0 throttle
>>    0/ 0 refs
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.35.log
>> --- end dump of recent events ---
>> 2014-12-01 15:50:52.627789 7fed13221700 -1 *** Caught signal (Aborted) **
>>  in thread 7fed13221700
>>
>>  ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>>  1: /usr/bin/ceph-osd() [0xa9767a]
>>  2: (()+0x10340) [0x7fed175cf340]
>>  3: (gsignal()+0x39) [0x7fed15a6dbb9]
>>  4: (abort()+0x148) [0x7fed15a70fc8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fed163796b5]
>>  6: (()+0x5e836) [0x7fed16377836]
>>  7: (()+0x5e863) [0x7fed16377863]
>>  8: (()+0x5eaa2) [0x7fed16377aa2]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x278) [0xb82508]
>>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> long)+0x2a9) [0xac0e19]
>>  11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0xac16a6]
>>  12: (ceph::HeartbeatMap::check_touch_file()+0x17) [0xac1d87]
>>  13: (CephContextServiceThread::entry()+0x154) [0xb96844]
>>  14: (()+0x8182) [0x7fed175c7182]
>>  15: (clone()+0x6d) [0x7fed15b31fbd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 0 lockdep
>>    0/ 0 context
>>    0/ 0 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 0 buffer
>>    0/ 0 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 rbd_replay
>>    0/ 0 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 0 osd
>>    0/ 0 optracker
>>    0/ 0 objclass
>>    0/ 0 filestore
>>    1/ 3 keyvaluestore
>>    0/ 0 journal
>>   10/10 ms
>>    1/ 5 mon
>>    0/ 0 monc
>>    1/ 5 paxos
>>    0/ 0 tp
>>    0/ 0 auth
>>    1/ 5 crypto
>>    0/ 0 finisher
>>    0/ 0 heartbeatmap
>>    0/ 0 perfcounter
>>    1/ 5 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    0/ 0 asok
>>    0/ 0 throttle
>>    0/ 0 refs
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.35.log
>> --- end dump of recent events ---
>>
>>
>>
>>
>> The recurring theme here is that there is a communication issue between
>> the osds.
>>
>> I looked carefully at my network hardware configuration (UCS C240s with
>> 40Gbps Cisco VICs connected to a pair of Nexus 5672 using A-FEX Port
>> Profile configuration) and couldn't find any dropped packets or errors.
>>
>> I ran "ss -s" for the first time on my osds and was a bit suprised to see
>> how many open TCP connections they all have.
>>
>> ceph@osd6:/var/log/ceph$ ss -s
>> Total: 1492 (kernel 0)
>> TCP:   1411 (estab 1334, closed 40, orphaned 0, synrecv 0, timewait 0/0),
>> ports 0
>>
>> Transport Total     IP        IPv6
>> *  0         -         -
>> RAW  0         0         0
>> UDP  10        4         6
>> TCP  *1371*      1369      2
>> INET  1381      1373      8
>> FRAG  0         0         0
>>
>> While researching if additional kernel tuning would be required to handle
>> so many connections, I eventually realized that I forgot to copy my
>> customized /etc/sysctl.conf file on the two new nodes. I'm not sure if the
>> large amount of TCP connections is part of the performance enhancements
>> between Giant and Firefly, or if Firefly uses a similar number of
>> connections.
>>
>> I highlighted the tuning parameters that I suspect are required.
>>
>>
>> ceph@osd6:/var/log/ceph$ cat /etc/sysctl.conf
>> #
>> # /etc/sysctl.conf - Configuration file for setting system variables
>> # See /etc/sysctl.d/ for additional system variables
>> # See sysctl.conf (5) for information.
>> #
>>
>> # Increase Linux autotuning TCP buffer limits
>> # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
>> # Don't set tcp_mem itself! Let the kernel scale it based on RAM.
>> net.core.rmem_max = 56623104
>> net.core.wmem_max = 56623104
>> net.core.rmem_default = 56623104
>> net.core.wmem_default = 56623104
>> net.core.optmem_max = 40960
>> net.ipv4.tcp_rmem = 4096 87380 56623104
>> net.ipv4.tcp_wmem = 4096 65536 56623104
>>
>> # Make room for more TIME_WAIT sockets due to more clients,
>> # and allow them to be reused if we run out of sockets
>> # Also increase the max packet backlog
>> net.core.somaxconn = 1024
>> *net.core.netdev_max_backlog = 50000*
>> *net.ipv4.tcp_max_syn_backlog = 30000*
>> *net.ipv4.tcp_max_tw_buckets = 2000000*
>> *net.ipv4.tcp_tw_reuse = 1*
>> *net.ipv4.tcp_fin_timeout = 10*
>>
>> # Disable TCP slow start on idle connections
>> net.ipv4.tcp_slow_start_after_idle = 0
>>
>> # If your servers talk UDP, also up these limits
>> net.ipv4.udp_rmem_min = 8192
>> net.ipv4.udp_wmem_min = 8192
>>
>> # Disable source routing and redirects
>> net.ipv4.conf.all.send_redirects = 0
>> net.ipv4.conf.all.accept_redirects = 0
>> net.ipv4.conf.all.accept_source_route = 0
>>
>>
>>
>>
>> I added the net.core.somaxconn after this experience, since the default
>> is 128.  This represents the allowed socket backlog in the kernel; which
>> should help when I reboot an osd node and 1300 connections need to be made
>> quickly.
>>
>> I found that I needed to restart my osds after applying the kernel tuning
>> above for my cluster to stabilize.
>>
>> My system is now stable again and performs very well.
>>
>> I hope this helps someone out.  It took me a few days of troubleshooting
>> to get to the bottom of this.
>>
>>
>> Jake
>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Giant osd problems - loss of IO

Reply via email to