On 08/05/2014 08:42 AM, Mariusz Gronczewski wrote:
On Mon, 04 Aug 2014 15:32:50 -0500, Mark Nelson
<mark.nel...@inktank.com> wrote:

On 08/04/2014 03:28 PM, Chris Kitzmiller wrote:
On Aug 1, 2014, at 1:31 PM, Mariusz Gronczewski wrote:
I got weird stalling during writes, sometimes I got same write speed
for few minutes and after some time it starts stalling with 0 MB/s for
minutes

I'm getting very similar behavior on my cluster. My writes start well but then just kinda 
stop for a while and then bump along slowly until the bench finishes. I've got a thread 
about it going here called "Ceph runs great then falters".

This kind of behaviour often results when the journal can write much
faster than the OSD data disks.  Initially the journals will be able to
absorb the data and things will run along well, but eventually ceph will
need to stall writes if things get too out of sync.  You may want to
take a look at what's happening on the data disks during your tests to
see if there's anything that looks suspect.  Checking the admin socket
for dump_historic_ops might provide some clues as well.

Mark

I did check journals already, they are on same disk as data (separate
partition) and during stalls there is no traffic to both of them (like
8 iops on average with 0% io wait).

This may indicate that 1 OSD could be backing up with possibly most if not all IOs waiting on it. The idea here is that because data placement is deterministic, if 1 OSD is slow, over time just by random chance all outstanding client operations will back up on it. Having more concurrency gives you more wiggle room but may not ultimately solve it.

It's also possible that something else may be causing the OSDs to wait. dump_historic_ops might help.

I'm 99.99% sure they are ok as they worked few months in cassandra
cluster before, and when I was doing some ceph rebalancing and
adding/removing OSDs they worked constantly with no stalls

If I look into logs of osds that have slow ops I get something like
that:

     2014-08-05 15:31:33.461481 7fbff4fd3700  0 log [WRN] : slow request 
30.147566 seconds old, received at 2014-08-05 15:31:03.313858: 
osd_op(client.190830.0:176 
benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 
7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6
     2014-08-05 15:32:03.467775 7fbff4fd3700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 60.153871 secs
     2014-08-05 15:32:03.467794 7fbff4fd3700  0 log [WRN] : slow request 
60.153871 seconds old, received at 2014-08-05 15:31:03.313858: 
osd_op(client.190830.0:176 
benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 
7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6
     2014-08-05 15:33:03.481163 7fbff4fd3700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 120.167272 secs
     2014-08-05 15:33:03.481170 7fbff4fd3700  0 log [WRN] : slow request 
120.167272 seconds old, received at 2014-08-05 15:31:03.313858: 
osd_op(client.190830.0:176 
benchmark_data_blade103.non.3dart.com_31565_object175 [write 0~4194304] 
7.16da2754 ack+ondisk+write e864) v4 currently waiting for subops from 2,6

but when I look on osd.2 I only get cryptic messages like:

     2014-08-05 14:39:34.708788 7ff5d36f4700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.24:6800/3540 pipe(0x54ada00 sd=157 :6800 s=2 pgs=26 cs=5 l=0 
c=0x5409340).fault with nothing to send, going to standby
     2014-08-05 14:39:35.594447 7ff5d0ecc700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.25:6800/3551 pipe(0x41fd500 sd=141 :60790 s=2 pgs=21 cs=5 l=0 
c=0x51bd960).fault with nothing to send, going to standby
     2014-08-05 14:39:37.594901 7ff5d40fe700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.25:6802/3709 pipe(0x54adc80 sd=149 :35693 s=2 pgs=24 cs=5 l=0 
c=0x51bee00).fault with nothing to send, going to standby
     2014-08-05 14:39:53.891172 7ff5d15d3700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.24:6802/3694 pipe(0x54a8500 sd=137 :60823 s=2 pgs=24 cs=5 l=0 
c=0x5409080).fault with nothing to send, going to standby
     2014-08-05 14:40:01.410307 7ff5d4905700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.23:6802/3689 pipe(0x41f9180 sd=133 :39288 s=2 pgs=28 cs=5 l=0 
c=0x4956f60).fault with nothing to send, going to standby
     2014-08-05 14:40:02.428402 7ff5d1ddb700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.23:6800/3535 pipe(0x41fee00 sd=145 :34278 s=2 pgs=23 cs=5 l=0 
c=0x51bad60).fault with nothing to send, going to standby
     2014-08-05 15:30:50.752795 7ff5d29e7700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.25:6800/3551 pipe(0x54ad780 sd=153 :6800 s=0 pgs=0 cs=0 l=0 
c=0x51ba260).accept connect_seq 6 vs existing 5 state standby
     2014-08-05 15:30:51.475488 7ff5d2cea700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.25:6802/3709 pipe(0x6002080 sd=158 :6800 s=0 pgs=0 cs=0 l=0 
c=0x51b94a0).accept connect_seq 6 vs existing 5 state standby
     2014-08-05 15:30:51.600661 7ff5d32f0700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.24:6802/3694 pipe(0x6001e00 sd=149 :6800 s=0 pgs=0 cs=0 l=0 
c=0x4955280).accept connect_seq 6 vs existing 5 state standby
     2014-08-05 15:30:51.901849 7ff5d18d6700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.23:6800/3535 pipe(0x6001180 sd=22 :6800 s=0 pgs=0 cs=0 l=0 
c=0x4956720).accept connect_seq 6 vs existing 5 state standby
     2014-08-05 15:30:54.529900 7ff5d0cca700  0 -- 10.100.245.22:6800/3540 >> 
10.100.245.23:6802/3689 pipe(0x54ae400 sd=42 :6800 s=0 pgs=0 cs=0 l=0 
c=0x51b91e0).accept connect_seq 6 vs existing 5 state standby


and similiar on osd.2:

     2014-08-05 14:39:13.606907 7f34d2140700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.21:6802/3686 pipe(0x4c4a800 sd=183 :6800 s=2 pgs=32 cs=5 l=0 
c=0x3848580).fault with nothing to send, going to standby
     2014-08-05 14:39:34.709150 7f34d3958700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.22:6800/3540 pipe(0x39f9b80 sd=166 :43213 s=2 pgs=27 cs=5 l=0 
c=0x38494a0).fault with nothing to send, going to standby
     2014-08-05 14:39:38.693139 7f34cee0e700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.25:6800/3551 pipe(0x3291b80 sd=167 :51687 s=2 pgs=23 cs=5 l=0 
c=0x384a680).fault with nothing to send, going to standby
     2014-08-05 14:39:38.712965 7f34d2b4a700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.21:6800/3532 pipe(0x4c4ad00 sd=195 :6800 s=2 pgs=31 cs=5 l=0 
c=0x3848160).fault with nothing to send, going to standby
     2014-08-05 14:39:42.694641 7f34d4961700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.25:6802/3709 pipe(0x4c4aa80 sd=179 :6800 s=2 pgs=21 cs=5 l=0 
c=0x3848b00).fault with nothing to send, going to standby
     2014-08-05 14:39:42.712340 7f34d3b5a700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.22:6802/3694 pipe(0x39f9680 sd=33 :47495 s=2 pgs=31 cs=5 l=0 
c=0x384ee00).fault with nothing to send, going to standby
     2014-08-05 14:39:44.493155 7f34d3453700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.23:6800/3535 pipe(0x4c48c80 sd=202 :6800 s=2 pgs=29 cs=5 l=0 
c=0x3848f20).fault with nothing to send, going to standby
     2014-08-05 14:39:59.387037 7f34d4c64700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.23:6802/3689 pipe(0x4c49b80 sd=34 :6800 s=2 pgs=25 cs=5 l=0 
c=0x3849600).fault with nothing to send, going to standby
     2014-08-05 15:30:51.911827 7f34cec0c700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.22:6802/3694 pipe(0x39fc880 sd=116 :6800 s=0 pgs=0 cs=0 l=0 
c=0x3b714a0).accept connect_seq 6 vs existing 5 state standby
     2014-08-05 15:30:52.276484 7f34d2241700  0 -- 10.100.245.24:6800/3540 >> 
10.100.245.22:6800/3540 pipe(0x39fc100 sd=118 :6800 s=0 pgs=0 cs=0 l=0 
c=0x3b73b20).accept connect_seq 6 vs existing 5 state standby

and 'bad' osds change when I rerun benchmark so it's not like there is one or 2 
bad nodes


I've checked for network or IO load on every node and they are just not doing 
anything, no kernel errors, and those nodes worked fine under load when they 
were us

I'm guessing the issue is probably going to be more subtle than that unfortunately. At least based on prior issues, it seems like often something is causing latency in some part of the system and when that happens it can have very far-reaching effects.




--
Mariusz Gronczewski, Administrator

Efigence S. A.
ul. WoĊ‚oska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
F: [+48] 22 380 13 14
E: mariusz.gronczew...@efigence.com
<mailto:mariusz.gronczew...@efigence.com>


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to