Re: [ceph-users] OSD state flipping when cluster-network in high utilization

Chen, Xiaoxi Wed, 15 May 2013 05:54:07 -0700

Hi Sage,
        I can reproduce it with log ,but the log is quite big (30MB+) , where 
should I put the log?
        From the log and reading the code,seems that with uncertain reason, the 
OSD_OP_THREAD will be blocked there for more than 15s, which results a 
time-out. Then in " handle_osd_ping " function, it dropped the ping request 
because of "is healthy"= false.
        I have a full log with debug osd 20, but still have no idea why op 
thread will be blocked for that long, part of the log is pasted below:


2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op 0xf4821e0 
prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
:2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096] 
3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 
mlcod 4103'5329 active+clean]
2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 
need_class_write_cap=0 -> yes
2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 may_write
2013-05-15 15:29:22.117110 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.127345 7f0269b6d700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.276280 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.356731 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.359808 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
000000000000339 0 -> 1
2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
f6b8b4567.0000000000000339/head//3
2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 
3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
0:2109755 wrlock_by=unknown.0.0:0)
2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
b8b4567.0000000000000339/head//3 @head
2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] do_op mode is idle(wr=0)
2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 
4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 
active+clean] do_op mode now rmw(wr=0)

-----Original Message-----
From: Sage Weil [mailto:s...@inktank.com] 
Sent: 2013年5月15日 11:40
To: Chen, Xiaoxi
Cc: Mark Nelson; ceph-de...@vger.kernel.org; ceph-us...@ceph.com
Subject: RE: [ceph-users] OSD state flipping when cluster-network in high 
utilization

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> >How responsive generally is the machine under load?  Is there available CPU?
>       The machine works well, and the issued OSDs are likely the same, seems 
> because they have relative slower disk( disk type are the same but the 
> latency is a bit higher ,8ms -> 10ms).
>       
>       Top show no idle % but still have 30+% of io_wait,  my colleague 
> educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve 
> the problem, what's more, when we doing random write test, we can 
> still see this flipping happen. So I would like to say it may related 
> with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get 
> enough cpu cycle.

Can you try reproducing with logs?  If we can reproduce that will give us some 
clues.

sage

> 
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: 2013?5?15? 7:23
> To: Chen, Xiaoxi
> Cc: Mark Nelson; ceph-de...@vger.kernel.org; ceph-us...@ceph.com
> Subject: Re: [ceph-users] OSD state flipping when cluster-network in 
> high utilization
> 
> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > I like the idea to leave ping in cluster network because it can help 
> > us detect switch?nic failure.
> > 
> > What confuse me is I keep pinging every ceph node's cluster ip?it is 
> > OK during the whole run with less than 1 ms latency?why the 
> > heartbeat still suffer? TOP show my cpu not 100% utilized?with ?30% 
> > io wait?.Enabling jumbo frame **seems** make things worth.(just 
> > feeling.no data supports)
> 
> I say "ping" in teh general sense.. it's not using ICMP, but sending small 
> messages over a TCP session, doing some minimal processing on the other end, 
> and sending them back.  If the machine is heavily loaded and that thread 
> doesn't get scheduled or somehow blocks, it may be problematic.
> 
> How responsive generally is the machine under load?  Is there available CPU?
> 
> We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> and 'debug osd = 20' is everything we would need, but will incur additoinal 
> load itself and may spoil the experiment...
> 
> sage
> 
> > 
> > ???? iPhone
> > 
> > ? 2013-5-14?23:36?"Mark Nelson" <mark.nel...@inktank.com> ???
> > 
> > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > >>> 
> > >>> Hi
> > >>> 
> > >>>   We are suffering our OSD flipping between up and down ( OSD X 
> > >>> be voted to down due to 3 missing ping, and after a while it 
> > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we 
> > >>> are running sequential write performance test on top of RBDs, 
> > >>> and the cluster network nics is really in high utilization (8Gb/s+ for 
> > >>> a 10Gb network).
> > >>> 
> > >>>          Is this a expected behavior ? or how can I prevent this happen?
> > >> 
> > >> You an increase the heartbeat grace period.  The pings are 
> > >> handled by a separate thread on the backside interface (if there is one).
> > >> If you are missing pings then the network or scheduler is 
> > >> preventing those (small) messages from being processed (there is 
> > >> almost no lock contention in that path).  Which means it really 
> > >> is taking ~20 seconds or wahtever to handle those messages.  It's 
> > >> really a questin of how unresponsive you want to permit the OSDs to be 
> > >> before you consider it a failure..
> > >> 
> > >> sage
> > >> 
> > >> 
> > > 
> > > It might be worth testing out how long pings or other network traffic are 
> > > taking during these tests.  There may be some tcp tunning you can do 
> > > here, or even consider using a separate network for the mons.
> > > 
> > > Mark
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? 
> ??zZ+??????"?!?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD state flipping when cluster-network in high utilization

Reply via email to