Sage,

Yes, I already set the max open file in ceph.conf. And I also checked in OS. 
Any suggestions?

# cat /proc/sys/fs/file-max
3274834

[global]
auth supported = cephx
keyring = /etc/ceph/keyring.admin
max open files = 131072
osd pool default pg num = 1600
osd pool default pgp num = 1600

Wei Cao (Buddy)

-----Original Message-----
From: Sage Weil [mailto:s...@inktank.com] 
Sent: Thursday, June 5, 2014 11:11 PM
To: Cao, Buddy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph osd down and out

This usually happens on larger clusters when you hit the max fd limit.  
Add
        max open files = 131072

in the [global] section of ceph.conf to fix it (default is 16384).

sage


On Thu, 5 Jun 2014, Cao, Buddy wrote:

> 
> Hi,  several osds were down/out with similar logs as below, could you help?
> 
>  
> 
>   -38> 2014-06-05 10:27:54.700832 7f2ceead6700  1 --
> 192.168.40.11:6800/19542 <== osd.11 192.168.40.11:6822/20298 2 ====
> pg_notify(0.aa4(2) epoch 7) v5 ==== 812+0+0 (3873498789 0 0) 0x57a0540 
> con
> 0x49d14a0
> 
>    -37> 2014-06-05 10:27:54.700874 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396463, event: header_read, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -36> 2014-06-05 10:27:54.701091 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396465, event: throttled, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -35> 2014-06-05 10:27:54.701126 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.396543, event: all_read, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -34> 2014-06-05 10:27:54.701187 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.700870, event: dispatched, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -33> 2014-06-05 10:27:54.701235 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.701234, event: waiting_for_osdmap, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -32> 2014-06-05 10:27:54.701267 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.701267, event: started, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -31> 2014-06-05 10:27:54.700940 7f2ce92cb700  5 osd.0 pg_epoch: 7 
> pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.003321 0 0.000000
> 
>    -30> 2014-06-05 10:27:54.702163 7f2ceead6700  5 -- op tracker -- , seq:
> 2028, time: 2014-06-05 10:27:54.702163, event: done, request:
> pg_notify(0.aa4(2) epoch 7) v5
> 
>    -29> 2014-06-05 10:27:54.701690 7f2ce92cb700  5 osd.0 pg_epoch: 7 
> pg[0.1dc1( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [69,56,21] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray
> 
>    -28> 2014-06-05 10:27:54.697805 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 
> lpr=3
> crt=0'0 mlcod 0'0 active] exit Started/Primary/Active 7.527967 0 
> 0.000000
> 
>    -27> 2014-06-05 10:27:54.702437 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 
> lpr=3
> crt=0'0 mlcod 0'0 active] exit Started/Primary 8.986091 0 0.000000
> 
>    -26> 2014-06-05 10:27:54.702467 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 
> lpr=3
> crt=0'0 mlcod 0'0 active] exit Started 8.986256 0 0.000000
> 
>    -25> 2014-06-05 10:27:54.702486 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 3/3/3) [0,100,64] r=0 
> lpr=3
> crt=0'0 mlcod 0'0 active] enter Reset
> 
>    -24> 2014-06-05 10:27:54.702822 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Reset 0.000336 1 0.005574
> 
>    -23> 2014-06-05 10:27:54.702900 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started
> 
>    -22> 2014-06-05 10:27:54.703025 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Start
> 
>    -21> 2014-06-05 10:27:54.703235 7f2ce8aca700  1 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] state<Start>: transitioning to Stray
> 
>    -20> 2014-06-05 10:27:54.704919 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] exit Start 0.001894 0 0.000000
> 
>    -19> 2014-06-05 10:27:54.704835 7f2ceead6700  1 --
> 192.168.40.11:6800/19542 <== osd.15 192.168.40.11:6830/20581 1 ====
> pg_notify(2.60(2) epoch 7) v5 ==== 812+0+0 (3557830937 0 0) 0x4a6f540 
> con
> 0x3e3bb20
> 
>    -18> 2014-06-05 10:27:54.705050 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598458, event: header_read, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -17> 2014-06-05 10:27:54.705079 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598461, event: throttled, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -16> 2014-06-05 10:27:54.705124 7f2ce8aca700  5 osd.0 pg_epoch: 7 
> pg[1.98( empty local-les=4 n=0 ec=1 les/c 4/4 7/7/7) [54,95,68] r=-1 
> lpr=7
> pi=3-6/1 crt=0'0 inactive NOTIFY] enter Started/Stray
> 
>    -15> 2014-06-05 10:27:54.705353 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.598689, event: all_read, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -14> 2014-06-05 10:27:54.706925 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.705046, event: dispatched, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -13> 2014-06-05 10:27:54.707038 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.707037, event: waiting_for_osdmap, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -12> 2014-06-05 10:27:54.707075 7f2ceead6700  5 -- op tracker -- , seq:
> 2029, time: 2014-06-05 10:27:54.707075, event: started, request:
> pg_notify(2.60(2) epoch 7) v5
> 
>    -11> 2014-06-05 10:27:54.708042 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60(unlocked)] enter Initial
> 
>    -10> 2014-06-05 10:27:54.709885 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=0
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Initial 0.001842 0 0.000000
> 
>     -9> 2014-06-05 10:27:54.711556 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=0
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Reset
> 
>     -8> 2014-06-05 10:27:54.711606 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000051 1 0.001718
> 
>     -7> 2014-06-05 10:27:54.711648 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started
> 
>     -6> 2014-06-05 10:27:54.711714 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Start
> 
>     -5> 2014-06-05 10:27:54.711781 7f2ceead6700  1 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to 
> Primary
> 
>     -4> 2014-06-05 10:27:54.713010 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] exit Start 0.001295 0 0.000000
> 
>     -3> 2014-06-05 10:27:54.713164 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary
> 
>     -2> 2014-06-05 10:27:54.713234 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering
> 
>     -1> 2014-06-05 10:27:54.713319 7f2ceead6700  5 osd.0 pg_epoch: 7 
> pg[2.60( empty local-les=0 n=0 ec=1 les/c 4/5 7/7/7) [0,117,49] r=0 
> lpr=7
> pi=1-6/2 crt=0'0 mlcod 0'0 peering] enter 
> Started/Primary/Peering/GetInfo
> 
>      0> 2014-06-05 10:27:54.711223 7f2ce92cb700 -1 common/Thread.cc: 
> In function 'void Thread::create(size_t)' thread 7f2ce92cb700 time 
> 2014-06-05
> 10:27:54.703693
> 
> common/Thread.cc: 110: FAILED assert(ret == 0)
> 
>  
> 
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
> 
> 1: (Thread::create(unsigned long)+0x8a) [0xa82dea]
> 
> 2: (SimpleMessenger::connect_rank(entity_addr_t const&, int, 
> Connection*,
> Message*)+0x17b) [0xa2b01b]
> 
> 3: (SimpleMessenger::get_connection(entity_inst_t const&)+0x180) 
> [0xa2f6c0]
> 
> 4: (OSDService::get_con_osd_cluster(int, unsigned int)+0x1d9) 
> [0x6061c9]
> 
> 5: (OSD::compat_must_dispatch_immediately(PG*)+0x296) [0x6064a6]
> 
> 6: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > 
> const&,
> ThreadPool::TPHandle&)+0x4a6) [0x6475d6]
> 
> 7: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > 
> const&,
> ThreadPool::TPHandle&)+0x16) [0x69dbf6]
> 
> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0xab4581]
> 
> 9: (ThreadPool::WorkThread::entry()+0x10) [0xab75c0]
> 
> 10: /lib64/libpthread.so.0() [0x3eebc079d1]
> 
> 11: (clone()+0x6d) [0x3eeb4e8b6d]
> 
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
>  
> 
> --- logging levels ---
> 
>    0/ 5 none
> 
>    0/ 1 lockdep
> 
>    0/ 1 context
> 
>    1/ 1 crush
> 
>    1/ 5 mds
> 
>    1/ 5 mds_balancer
> 
>    1/ 5 mds_locker
> 
>    1/ 5 mds_log
> 
>    1/ 5 mds_log_expire
> 
>    1/ 5 mds_migrator
> 
>    0/ 1 buffer
> 
>    0/ 1 timer
> 
>    0/ 1 filer
> 
>    0/ 1 striper
> 
>    0/ 1 objecter
> 
>    0/ 5 rados
> 
>    0/ 5 rbd
> 
>    0/ 5 journaler
> 
>    0/ 5 objectcacher
> 
>    0/ 5 client
> 
>    0/ 5 osd
> 
>    0/ 5 optracker
> 
>    0/ 5 objclass
> 
>    1/ 3 filestore
> 
>    1/ 3 keyvaluestore
> 
>    1/ 3 journal
> 
>    0/ 5 ms
> 
>    1/ 5 mon
> 
>    0/10 monc
> 
>    1/ 5 paxos
> 
>    0/ 5 tp
> 
>    1/ 5 auth
> 
>    1/ 5 crypto
> 
>    1/ 1 finisher
> 
>    1/ 5 heartbeatmap
> 
>    1/ 5 perfcounter
> 
>    1/ 5 rgw
> 
>    1/ 5 javaclient
> 
>    1/ 5 asok
> 
>    1/ 1 throttle
> 
>   -2/-2 (syslog threshold)
> 
>   -1/-1 (stderr threshold)
> 
>   max_recent     10000
> 
>   max_new         1000
> 
>   log_file /var/log/ceph/ceph-osd.0.log
> 
> --- end dump of recent events ---
> 
> 2014-06-05 10:27:55.096296 7f2ca8628700  0 -- 192.168.40.11:6800/19542 
> >>
> 192.168.40.11:6808/19811 pipe(0x4adc100 sd=629 :6800 s=2 pgs=111 cs=1 
> l=0 c=0x4a0c780).fault with nothing to send, going to standby
> 
> 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal (Aborted) 
> **
> 
> in thread 7f2ce92cb700
> 
>  
> 
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
> 
> 1: /usr/bin/ceph-osd() [0x9aa211]
> 
> 2: /lib64/libpthread.so.0() [0x3eebc0f710]
> 
> 3: (gsignal()+0x35) [0x3eeb432925]
> 
> 4: (abort()+0x175) [0x3eeb434105]
> 
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d]
> 
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
>  
> 
> --- begin dump of recent events ---
> 
>     -6> 2014-06-05 10:27:54.716422 7f2cec2d1700  1 --
> 192.168.40.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ==== 
> osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 
> (243568413
> 0 0) 0x4a68380 con 0x3e3c200
> 
>     -5> 2014-06-05 10:27:54.716480 7f2cec2d1700  1 --
> 192.168.40.11:6801/19542 --> 192.168.40.14:0/29977 -- 
> osd_ping(ping_reply e7 stamp 2014-06-05 10:27:54.713281) v2 -- ?+0 
> 0x5436740 con 0x3e3c200
> 
>     -4> 2014-06-05 10:27:54.717351 7f2ced2d3700  1 --
> 192.168.50.11:6801/19542 <== osd.119 192.168.40.14:0/29977 3 ==== 
> osd_ping(ping e7 stamp 2014-06-05 10:27:54.713281) v2 ==== 47+0+0 
> (243568413
> 0 0) 0x4a68fc0 con 0x3e3c8e0
> 
>     -3> 2014-06-05 10:27:55.096211 7f2ca8628700  2 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 
> sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).reader couldn't read tag, (0) 
> Success
> 
>     -2> 2014-06-05 10:27:55.096258 7f2ca8628700  2 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 
> sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault (0) Success
> 
>     -1> 2014-06-05 10:27:55.096296 7f2ca8628700  0 --
> 192.168.40.11:6800/19542 >> 192.168.40.11:6808/19811 pipe(0x4adc100 
> sd=629
> :6800 s=2 pgs=111 cs=1 l=0 c=0x4a0c780).fault with nothing to send, 
> going to standby
> 
>      0> 2014-06-05 10:27:55.129281 7f2ce92cb700 -1 *** Caught signal
> (Aborted) **
> 
> in thread 7f2ce92cb700
> 
>  
> 
> ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003)
> 
> 1: /usr/bin/ceph-osd() [0x9aa211]
> 
> 2: /lib64/libpthread.so.0() [0x3eebc0f710]
> 
> 3: (gsignal()+0x35) [0x3eeb432925]
> 
> 4: (abort()+0x175) [0x3eeb434105]
> 
> 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x3eed8bea5d]
> 
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
>  
> 
> --- logging levels ---
> 
>    0/ 5 none
> 
>    0/ 1 lockdep
> 
>    0/ 1 context
> 
>    1/ 1 crush
> 
>    1/ 5 mds
> 
>    1/ 5 mds_balancer
> 
>    1/ 5 mds_locker
> 
>    1/ 5 mds_log
> 
>    1/ 5 mds_log_expire
> 
>    1/ 5 mds_migrator
> 
>    0/ 1 buffer
> 
>    0/ 1 timer
> 
>    0/ 1 filer
> 
>    0/ 1 striper
> 
>    0/ 1 objecter
> 
>    0/ 5 rados
> 
>    0/ 5 rbd
> 
>    0/ 5 journaler
> 
>    0/ 5 objectcacher
> 
>    0/ 5 client
> 
>    0/ 5 osd
> 
>    0/ 5 optracker
> 
>    0/ 5 objclass
> 
>    1/ 3 filestore
> 
>    1/ 3 keyvaluestore
> 
>    1/ 3 journal
> 
>    0/ 5 ms
> 
>    1/ 5 mon
> 
>    0/10 monc
> 
>    1/ 5 paxos
> 
>    0/ 5 tp
> 
>    1/ 5 auth
> 
>    1/ 5 crypto
> 
>    1/ 1 finisher
> 
>    1/ 5 heartbeatmap
> 
>    1/ 5 perfcounter
> 
>    1/ 5 rgw
> 
>    1/ 5 javaclient
> 
>    1/ 5 asok
> 
>    1/ 1 throttle
> 
>   -2/-2 (syslog threshold)
> 
>   -1/-1 (stderr threshold)
> 
>   max_recent     10000
> 
>   max_new         1000
> 
>   log_file /var/log/ceph/ceph-osd.0.log
> 
> --- end dump of recent events ---
> 
>  
> 
>  
> 
> Wei Cao (Buddy)
> 
>  
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to