This usually happens on larger clusters when you hit the max fd limit. Add max open files = 131072
in the [global] section of ceph.conf to fix it (default is 16384). sage On Thu, 5 Jun 2014, Cao, Buddy wrote: > > Hi, several osds were down/out with similar logs as below, could you help? > > > > -38> 2014-06-05 10:27:54.700832 7f2ceead6700 1 -- > 192.168.40.11:6800/19542 <== osd.11 192.168.40.11:6822/20298 2 ==== > pg_notify(0.aa4(2) epoch 7) v5 ==== 812+0+0 (3873498789 0 0) 0x57a0540 con > 0x49d14a0 > > -37> 2014-06-05 10:27:54.700874 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.396463, event: header_read, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -36> 2014-06-05 10:27:54.701091 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.396465, event: throttled, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -35> 2014-06-05 10:27:54.701126 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.396543, event: all_read, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -34> 2014-06-05 10:27:54.701187 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.700870, event: dispatched, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -33> 2014-06-05 10:27:54.701235 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.701234, event: waiting_for_osdmap, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -32> 2014-06-05 10:27:54.701267 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.701267, event: started, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -31> 2014-06-05 10:27:54.700940 7f2ce92cb700 5 osd.0 pg_epoch: 7 > pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.003321 0 0.000000 > > -30> 2014-06-05 10:27:54.702163 7f2ceead6700 5 -- op tracker -- , seq: > 2028, time: 2014-06-05 10:27:54.702163, event: done, request: > pg_notify(0.aa4(2) epoch 7) v5 > > -29> 2014-06-05 10:27:54.701690 7f2ce92cb700 5 osd.0 pg_epoch: 7 > pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray > > -28> 2014-06-05 10:27:54.697805 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3 > crt=0'0 mlcod 0'0 active] exit Started/Primary/Active 7.527967 0 0.000000 > > -27> 2014-06-05 10:27:54.702437 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3 > crt=0'0 mlcod 0'0 active] exit Started/Primary 8.986091 0 0.000000 > > -26> 2014-06-05 10:27:54.702467 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3 > crt=0'0 mlcod 0'0 active] exit Started 8.986256 0 0.000000 > > -25> 2014-06-05 10:27:54.702486 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 lpr=3 > crt=0'0 mlcod 0'0 active] enter Reset > > -24> 2014-06-05 10:27:54.702822 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] exit Reset 0.000336 1 0.005574 > > -23> 2014-06-05 10:27:54.702900 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started > > -22> 2014-06-05 10:27:54.703025 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] enter Start > > -21> 2014-06-05 10:27:54.703235 7f2ce8aca700 1 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray > > -20> 2014-06-05 10:27:54.704919 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.001894 0 0.000000 > > -19> 2014-06-05 10:27:54.704835 7f2ceead6700 1 -- > 192.168.40.11:6800/19542 <== osd.15 192.168.40.11:6830/20581 1 ==== > pg_notify(2.60(2) epoch 7) v5 ==== 812+0+0 (3557830937 0 0) 0x4a6f540 con > 0x3e3bb20 > > -18> 2014-06-05 10:27:54.705050 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.598458, event: header_read, request: > pg_notify(2.60(2) epoch 7) v5 > > -17> 2014-06-05 10:27:54.705079 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.598461, event: throttled, request: > pg_notify(2.60(2) epoch 7) v5 > > -16> 2014-06-05 10:27:54.705124 7f2ce8aca700 5 osd.0 pg_epoch: 7 > pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 lpr=7 > pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray > > -15> 2014-06-05 10:27:54.705353 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.598689, event: all_read, request: > pg_notify(2.60(2) epoch 7) v5 > > -14> 2014-06-05 10:27:54.706925 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.705046, event: dispatched, request: > pg_notify(2.60(2) epoch 7) v5 > > -13> 2014-06-05 10:27:54.707038 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.707037, event: waiting_for_osdmap, request: > pg_notify(2.60(2) epoch 7) v5 > > -12> 2014-06-05 10:27:54.707075 7f2ceead6700 5 -- op tracker -- , seq: > 2029, time: 2014-06-05 10:27:54.707075, event: started, request: > pg_notify(2.60(2) epoch 7) v5 > > -11> 2014-06-05 10:27:54.708042 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60(unlocked)] enter Initial > > -10> 2014-06-05 10:27:54.709885 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=0 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Initial 0.001842 0 0.000000 > > -9> 2014-06-05 10:27:54.711556 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=0 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Reset > > -8> 2014-06-05 10:27:54.711606 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000051 1 0.001718 > > -7> 2014-06-05 10:27:54.711648 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started > > -6> 2014-06-05 10:27:54.711714 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Start > > -5> 2014-06-05 10:27:54.711781 7f2ceead6700 1 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary > > -4> 2014-06-05 10:27:54.713010 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Start 0.001295 0 0.000000 > > -3> 2014-06-05 10:27:54.713164 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary > > -2> 2014-06-05 10:27:54.713234 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering > > -1> 2014-06-05 10:27:54.713319 7f2ceead6700 5 osd.0 pg_epoch: 7 > pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 lpr=7 > pi=1-6/2 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo > > 0> 2014-06-05 10:27:54.711223 7f2ce92cb700 -1 common/Thread.cc: In > function 'void Thread::create(size_t)' thread 7f2ce92cb700 time 2014-06-05 > 10:27:54.703693 > > common/Thread.cc: 110: FAILED assert(ret == 0) > > > > ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003) > > 1: (Thread::create(unsigned long)+0x8a) [0xa82dea] > > 2: (SimpleMessenger::connect_rank(entity_addr_t const&, int, Connection*, > Message*)+0x17b) [0xa2b01b] > > 3: (SimpleMessenger::get_connection(entity_inst_t const&)+0x180) [0xa2f6c0] > > 4: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1d9) [0x6061c9] > > 5: (OSD::compat_must_dispatch_immediately(PG*)+0x296) [0x6064a6] > > 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, > ThreadPool::TPHandle&)+0x4a6) [0x6475d6] > > 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, > ThreadPool::TPHandle&)+0x16) [0x69dbf6] > > 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0xab4581] > > 9: (ThreadPool::WorkThread::entry()+0x10) [0xab75c0] > > 10: /lib64/libpthread.so.0() [0x3eebc079d1] > > 11: (clone()+0x6d) [0x3eeb4e8b6d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > > > --- logging levels --- > > 0/ 5 none > > 0/ 1 lockdep > > 0/ 1 context > > 1/ 1 crush > > 1/ 5 mds > > 1/ 5 mds_balancer > > 1/ 5 mds_locker > > 1/ 5 mds_log > > 1/ 5 mds_log_expire > > 1/ 5 mds_migrator > > 0/ 1 buffer > > 0/ 1 timer > > 0/ 1 filer > > 0/ 1 striper > > 0/ 1 objecter > > 0/ 5 rados > > 0/ 5 rbd > > 0/ 5 journaler > > 0/ 5 objectcacher > > 0/ 5 client > > 0/ 5 osd > > 0/ 5 optracker > > 0/ 5 objclass > > 1/ 3 filestore > > 1/ 3 keyvaluestore > > 1/ 3 journal > > 0/ 5 ms > > 1/ 5 mon > > 0/10 monc > > 1/ 5 paxos > > 0/ 5 tp > > 1/ 5 auth > > 1/ 5 crypto > > 1/ 1 finisher > > 1/ 5 heartbeatmap > > 1/ 5 perfcounter > > 1/ 5 rgw > > 1/ 5 javaclient > > 1/ 5 asok > > 1/ 1 throttle > > -2/-2 (syslog threshold) > > -1/-1 (stderr threshold) > > max_recent 10000 > > max_new 1000 > > log_file /var/log/ceph/ceph-osd.0.log > > --- end dump of recent events --- > > 2014-06-05 10:27:55.096296 7f2ca8628700 0 -- 192.168.40.11:6800/19542 >> > 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 :6800 s=2 pgs=111 cs=1 l=0 > c=0x4a0c780).fault with nothing to send, going to standby > > 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal (Aborted) ** > > in thread 7f2ce92cb700 > > > > ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003) > > 1: /usr/bin/ceph-osd() [0x9aa211] > > 2: /lib64/libpthread.so.0() [0x3eebc0f710] > > 3: (gsignal()+0x35) [0x3eeb432925] > > 4: (abort()+0x175) [0x3eeb434105] > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > > > --- begin dump of recent events --- > > -6> 2014-06-05 10:27:54.716422 7f2cec2d1700 1 -- > 192.168.40.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ==== > osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 (243568413 > 0 0) 0x4a68380 con 0x3e3c200 > > -5> 2014-06-05 10:27:54.716480 7f2cec2d1700 1 -- > 192.168.40.11:6801/19542 --> 192.168.40.14:0/29977 -- osd_ping(ping_reply e7 > stamp 2014-06-05 10:27:54.713281) v2 -- ?+0 0x5436740 con 0x3e3c200 > > -4> 2014-06-05 10:27:54.717351 7f2ced2d3700 1 -- > 192.168.50.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ==== > osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 (243568413 > 0 0) 0x4a68fc0 con 0x3e3c8e0 > > -3> 2014-06-05 10:27:55.096211 7f2ca8628700 2 -- > 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 > :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).reader couldn't read tag, (0) > Success > > -2> 2014-06-05 10:27:55.096258 7f2ca8628700 2 -- > 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 > :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault (0) Success > > -1> 2014-06-05 10:27:55.096296 7f2ca8628700 0 -- > 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 > :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault with nothing to send, going to > standby > > 0> 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal > (Aborted) ** > > in thread 7f2ce92cb700 > > > > ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003) > > 1: /usr/bin/ceph-osd() [0x9aa211] > > 2: /lib64/libpthread.so.0() [0x3eebc0f710] > > 3: (gsignal()+0x35) [0x3eeb432925] > > 4: (abort()+0x175) [0x3eeb434105] > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > > > --- logging levels --- > > 0/ 5 none > > 0/ 1 lockdep > > 0/ 1 context > > 1/ 1 crush > > 1/ 5 mds > > 1/ 5 mds_balancer > > 1/ 5 mds_locker > > 1/ 5 mds_log > > 1/ 5 mds_log_expire > > 1/ 5 mds_migrator > > 0/ 1 buffer > > 0/ 1 timer > > 0/ 1 filer > > 0/ 1 striper > > 0/ 1 objecter > > 0/ 5 rados > > 0/ 5 rbd > > 0/ 5 journaler > > 0/ 5 objectcacher > > 0/ 5 client > > 0/ 5 osd > > 0/ 5 optracker > > 0/ 5 objclass > > 1/ 3 filestore > > 1/ 3 keyvaluestore > > 1/ 3 journal > > 0/ 5 ms > > 1/ 5 mon > > 0/10 monc > > 1/ 5 paxos > > 0/ 5 tp > > 1/ 5 auth > > 1/ 5 crypto > > 1/ 1 finisher > > 1/ 5 heartbeatmap > > 1/ 5 perfcounter > > 1/ 5 rgw > > 1/ 5 javaclient > > 1/ 5 asok > > 1/ 1 throttle > > -2/-2 (syslog threshold) > > -1/-1 (stderr threshold) > > max_recent 10000 > > max_new 1000 > > log_file /var/log/ceph/ceph-osd.0.log > > --- end dump of recent events --- > > > > > > Wei Cao (Buddy) > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com