Hi Sage, I can reproduce it with log ,but the log is quite big (30MB+) , where should I put the log? From the log and reading the code,seems that with uncertain reason, the OSD_OP_THREAD will be blocked there for more than 15s, which results a time-out. Then in " handle_osd_ping " function, it dropped the ping request because of "is healthy"= false. I have a full log with debug osd 20, but still have no idea why op thread will be blocked for that long, part of the log is pasted below:
2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0 :2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] 2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd) owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 need_class_write_cap=0 -> yes 2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data .11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 may_write 2013-05-15 15:29:22.117110 7f025d354700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15 2013-05-15 15:29:22.127345 7f0269b6d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15 2013-05-15 15:29:22.276280 7f025d354700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15 2013-05-15 15:29:22.356731 7f025d354700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15 2013-05-15 15:29:22.359808 7f025d354700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15 2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0 000000000000339 0 -> 1 2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f f6b8b4567.0000000000000339/head//3 2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703. 0:2109755 wrlock_by=unknown.0.0:0) 2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6 b8b4567.0000000000000339/head//3 @head 2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode is idle(wr=0) 2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec =100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode now rmw(wr=0) -----Original Message----- From: Sage Weil [mailto:s...@inktank.com] Sent: 2013年5月15日 11:40 To: Chen, Xiaoxi Cc: Mark Nelson; ceph-de...@vger.kernel.org; ceph-us...@ceph.com Subject: RE: [ceph-users] OSD state flipping when cluster-network in high utilization On Wed, 15 May 2013, Chen, Xiaoxi wrote: > >How responsive generally is the machine under load? Is there available CPU? > The machine works well, and the issued OSDs are likely the same, seems > because they have relative slower disk( disk type are the same but the > latency is a bit higher ,8ms -> 10ms). > > Top show no idle % but still have 30+% of io_wait, my colleague > educate me that io_wait can be treated as free. > > Another information is offload the heartbeat to 1Gb nic doesn't solve > the problem, what's more, when we doing random write test, we can > still see this flipping happen. So I would like to say it may related > with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get > enough cpu cycle. Can you try reproducing with logs? If we can reproduce that will give us some clues. sage > > -----Original Message----- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > Sent: 2013?5?15? 7:23 > To: Chen, Xiaoxi > Cc: Mark Nelson; ceph-de...@vger.kernel.org; ceph-us...@ceph.com > Subject: Re: [ceph-users] OSD state flipping when cluster-network in > high utilization > > On Tue, 14 May 2013, Chen, Xiaoxi wrote: > > I like the idea to leave ping in cluster network because it can help > > us detect switch?nic failure. > > > > What confuse me is I keep pinging every ceph node's cluster ip?it is > > OK during the whole run with less than 1 ms latency?why the > > heartbeat still suffer? TOP show my cpu not 100% utilized?with ?30% > > io wait?.Enabling jumbo frame **seems** make things worth.(just > > feeling.no data supports) > > I say "ping" in teh general sense.. it's not using ICMP, but sending small > messages over a TCP session, doing some minimal processing on the other end, > and sending them back. If the machine is heavily loaded and that thread > doesn't get scheduled or somehow blocks, it may be problematic. > > How responsive generally is the machine under load? Is there available CPU? > > We can try to enable debugging to see what is going on.. 'debug ms = 1' > and 'debug osd = 20' is everything we would need, but will incur additoinal > load itself and may spoil the experiment... > > sage > > > > > ???? iPhone > > > > ? 2013-5-14?23:36?"Mark Nelson" <mark.nel...@inktank.com> ??? > > > > > On 05/14/2013 10:30 AM, Sage Weil wrote: > > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote: > > >>> > > >>> Hi > > >>> > > >>> We are suffering our OSD flipping between up and down ( OSD X > > >>> be voted to down due to 3 missing ping, and after a while it > > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we > > >>> are running sequential write performance test on top of RBDs, > > >>> and the cluster network nics is really in high utilization (8Gb/s+ for > > >>> a 10Gb network). > > >>> > > >>> Is this a expected behavior ? or how can I prevent this happen? > > >> > > >> You an increase the heartbeat grace period. The pings are > > >> handled by a separate thread on the backside interface (if there is one). > > >> If you are missing pings then the network or scheduler is > > >> preventing those (small) messages from being processed (there is > > >> almost no lock contention in that path). Which means it really > > >> is taking ~20 seconds or wahtever to handle those messages. It's > > >> really a questin of how unresponsive you want to permit the OSDs to be > > >> before you consider it a failure.. > > >> > > >> sage > > >> > > >> > > > > > > It might be worth testing out how long pings or other network traffic are > > > taking during these tests. There may be some tcp tunning you can do > > > here, or even consider using a separate network for the mons. > > > > > > Mark > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majord...@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? > ??zZ+??????"?!? _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com