----- Message from Gregory Farnum <g...@inktank.com> ---------
Date: Wed, 21 May 2014 15:46:17 -0700
From: Gregory Farnum <g...@inktank.com>
Subject: Re: [ceph-users] Expanding pg's of an erasure coded pool
To: Kenneth Waegeman <kenneth.waege...@ugent.be>
Cc: ceph-users <ceph-users@lists.ceph.com>
On Wed, May 21, 2014 at 3:52 AM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:
Thanks! I increased the max processes parameter for all daemons quite a lot
(until ulimit -u 3802720)
These are the limits for the daemons now..
[root@ ~]# cat /proc/17006/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size unlimited unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 3802720 3802720
processes
Max open files 32768 32768 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 95068 95068 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
But this didn't help. Are there other parameters I should change?
Hrm, is it exactly the same stack trace? You might need to bump the
open files limit as well, although I'd be surprised. :/
I increased the open file limit as test to 128000, still the same results.
Stack trace:
-16> 2014-05-22 11:10:05.262456 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6293/6310/6293)
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit
Started/ReplicaActive/RepNotRecovering 52.314752 4 0.000408
-15> 2014-05-22 11:10:05.262649 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6293/6310/6293)
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started/ReplicaActive
52.315020 0 0.000000
-14> 2014-05-22 11:10:05.262667 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6293/6310/6293)
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14
lpr=6310 pi=6293-6309/1 crt=0'0 active] exit Started 55.181842 0
0.000000
-13> 2014-05-22 11:10:05.262681 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6293/6310/6293)
[255,52,147,15,402,280,129,321,125,180,301,85,22,340,398] r=14
lpr=6310 pi=6293-6309/1 crt=0'0 active] enter Reset
-12> 2014-05-22 11:10:05.262797 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Reset 0.000117 1
0.000338
-11> 2014-05-22 11:10:05.262956 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Started
-10> 2014-05-22 11:10:05.262983 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Start
-9> 2014-05-22 11:10:05.262994 7f3bfcaee700 1 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] state<Start>:
transitioning to Stray
-8> 2014-05-22 11:10:05.263151 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] exit Start 0.000169 0
0.000000
-7> 2014-05-22 11:10:05.263385 7f3bfcaee700 5 osd.398 pg_epoch:
6327 pg[16.8f5s14( empty local-les=6326 n=0 ec=6293 les/c 6326/6326
6327/6327/6327)
[200,176,57,135,107,426,234,409,264,280,338,381,317,220,79] r=-1
lpr=6327 pi=6293-6326/2 crt=0'0 inactive NOTIFY] enter Started/Stray
-6> 2014-05-22 11:10:05.264331 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.192:6910/26664 --
pg_notify(16.8f5s14(2) epoch 6327) v5 -- ?+0 0x6a396c0 con 0x5db77a0
-5> 2014-05-22 11:10:05.264551 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.192:6829/30080 --
pg_notify(16.b51s1(2) epoch 6327) v5 -- ?+0 0x6964280 con 0x43f11e0
-4> 2014-05-22 11:10:05.264894 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.122:6819/31185 --
pg_notify(16.f2es11(2) epoch 6327) v5 -- ?+0 0x790cd00 con 0x50ed280
-3> 2014-05-22 11:10:05.313524 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.122:6819/31185 --
pg_notify(16.e4es3(2) epoch 6327) v5 -- ?+0 0x682b9c0 con 0x50ed280
-2> 2014-05-22 11:10:05.314115 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.182:6864/15414 --
pg_notify(16.c07s10(2) epoch 6327) v5 -- ?+0 0x790de80 con 0x537a940
-1> 2014-05-22 11:10:05.314420 7f3bfcaee700 1 --
130.246.178.141:6928/31869 --> 130.246.179.192:6910/26664 --
pg_notify(16.8f5s14(2) epoch 6327) v5 -- ?+0 0x71ee740 con 0x5db77a0
0> 2014-05-22 11:10:05.322346 7f3c010f5700 -1 common/Thread.cc:
In function 'void Thread::create(size_t)' thread 7f3c010f5700 time
2014-05-22 11:10:05.320827
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
3: (Accepter::entry()+0x265) [0xb3ca45]
4: (()+0x79d1) [0x7f3c16e559d1]
5: (clone()+0x6d) [0x7f3c15b90b6d]
2014-05-22 11:10:05.335487 7f3bffcf3700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7f3bffcf3700 time
2014-05-22 11:10:05.334302
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0xa2a6aa]
3: (Accepter::entry()+0x265) [0xb3ca45]
4: (()+0x79d1) [0x7f3c16e559d1]
5: (clone()+0x6d) [0x7f3c15b90b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2014-05-22 11:10:05.718101 7f3b99324700 0 --
130.246.178.141:6928/31869 >> 130.246.178.141:6817/15419
pipe(0x673eb80 sd=896 :6928 s=0 pgs=0 cs=0 l=0 c=0x68e5c
20).accept connect_seq 0 vs existing 0 state connecting
2014-05-22 11:10:05.720679 7f3b99324700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7f3b99324700 time
2014-05-22 11:10:05.719300
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
2: (Pipe::accept()+0x493a) [0xb305fa]
3: (Pipe::reader()+0x1b9e) [0xb3350e]
4: (Pipe::Reader::entry()+0xd) [0xb359ed]
5: (()+0x79d1) [0x7f3c16e559d1]
6: (clone()+0x6d) [0x7f3c15b90b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2014-05-22 11:10:06.033702 7f3b99223700 0 --
130.246.178.141:6928/31869 >> 130.246.179.168:6906/13921
pipe(0x673a800 sd=897 :6928 s=0 pgs=0 cs=0 l=0 c=0x68e7d
20).accept connect_seq 0 vs existing 0 state wait
--- begin dump of recent events ---
-12> 2014-05-22 11:10:05.322880 7f3c100ba700 5 osd.398 6327 tick
-11> 2014-05-22 11:10:05.325790 7f3c100ba700 -1 common/Thread.cc:
In function 'void Thread::create(size_t)' thread 7f3c100ba700 time
2014-05-22 11:10:05.323
100
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (Thread::create(unsigned long)+0x8a) [0xa83f8a]
2: (SimpleMessenger::connect_rank(entity_addr_t const&, int,
Connection*, Message*)+0x17b) [0xa2c1bb]
3: (SimpleMessenger::get_connection(entity_inst_t const&)+0x180) [0xa30860]
4: (OSDService::get_con_osd_hb(int, unsigned int)+0x189) [0x5fe3d9]
5: (OSD::_add_heartbeat_peer(int)+0xa9) [0x5fe6b9]
6: (OSD::maybe_update_heartbeat_peers()+0x7be) [0x60b59e]
7: (OSD::tick()+0x217) [0x64fe17]
8: (Context::complete(int)+0x9) [0x660579]
9: (SafeTimer::timer_thread()+0x453) [0xab20a3]
10: (SafeTimerThread::entry()+0xd) [0xab425d]
11: (()+0x79d1) [0x7f3c16e559d1]
12: (clone()+0x6d) [0x7f3c15b90b6d]
Monitor
-14> 2014-05-22 11:19:33.849633 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62148 ====
forward(mon_command({"prefix": "osd cr
ush create-or-move", "args": ["host=gdss523", "root=default"], "id":
401, "weight": 1.8200000000000001} v 0) v1 caps allow profile osd tid
60304 con_features 8
796093022207) to leader v2 ==== 429+0+0 (3816634502 0 0) 0x26c77580
con 0x2361760
-13> 2014-05-22 11:19:33.849726 7fa7d9a8e700 0
mon.gdss514@0(leader) e2 handle_command mon_command({"prefix": "osd
crush create-or-move", "args": ["host=gd
ss523", "root=default"], "id": 401, "weight": 1.8200000000000001} v 0) v1
-12> 2014-05-22 11:19:33.849815 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62149 ====
forward(osd_alive(want up_thru 6383 ha
ve 6401) v1 caps allow profile osd tid 60305 con_features
8796093022207) to leader v2 ==== 288+0+0 (3476969871 0 0) 0x2a9cb480
con 0x2361760
-11> 2014-05-22 11:19:33.849869 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62150 ====
forward(osd_alive(want up_thru 6402 ha
ve 6402) v1 caps allow profile osd tid 60306 con_features
8796093022207) to leader v2 ==== 288+0+0 (4190502222 0 0) 0x36f2f80
con 0x2361760
-10> 2014-05-22 11:19:33.849892 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62151 ====
forward(osd_pgtemp(e6402 {2.24=[]} v64
02) v1 caps allow profile osd tid 60307 con_features 8796093022207) to
leader v2 ==== 313+0+0 (154505712 0 0) 0x2500f00 con 0x2361760
-9> 2014-05-22 11:19:33.849912 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62152 ====
forward(osd_pgtemp(e6402 {2.24=[]} v64
02) v1 caps allow profile osd tid 60308 con_features 8796093022207) to
leader v2 ==== 313+0+0 (2579287395 0 0) 0x29abaa80 con 0x2361760
-8> 2014-05-22 11:19:33.849954 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== mon.2 130.246.179.191:6789/0 62153 ====
forward(osd_alive(want up_thru 6402 ha
ve 6402) v1 caps allow profile osd tid 60309 con_features
8796093022207) to leader v2 ==== 288+0+0 (28071569 0 0) 0x2b1d5f00 con
0x2361760
-7> 2014-05-22 11:19:34.063692 7fa7da48f700 5
mon.gdss514@0(leader).paxos(paxos updating c 42420..43081)
queue_proposal bl 42272 bytes; ctx = 0x29dc0470
-6> 2014-05-22 11:19:34.063725 7fa7da48f700 5
mon.gdss514@0(leader).paxos(paxos updating c 42420..43081)
propose_new_value not active; proposal queued
-5> 2014-05-22 11:19:34.422715 7fa7d9a8e700 1 --
130.246.179.122:6789/0 <== osd.76 130.246.179.181:6880/10080 10 ====
mon_subscribe({monmap=3+,osd_pg_crea
tes=0,osdmap=6399}) v2 ==== 69+0+0 (1136537960 0 0) 0x217508c0 con 0x2ac740a0
-4> 2014-05-22 11:19:34.422782 7fa7d9a8e700 5
mon.gdss514@0(leader).osd e6402 send_incremental [6399..6402] to
osd.76 130.246.179.181:6880/10080
-3> 2014-05-22 11:19:34.423020 7fa7d9a8e700 1 --
130.246.179.122:6789/0 --> osd.76 130.246.179.181:6880/10080 --
osd_map(6399..6402 src has 5747..6402) v3
-- ?+0 0x27375c40
-2> 2014-05-22 11:19:34.649756 7fa7c8848700 2 --
130.246.179.122:6789/0 >> 130.246.179.88:6880/17558 pipe(0x2b5d2080
sd=32 :6789 s=2 pgs=1 cs=1 l=1 c=0x2a
d40dc0).reader couldn't read tag, (0) Success
-1> 2014-05-22 11:19:34.649822 7fa7c8848700 2 --
130.246.179.122:6789/0 >> 130.246.179.88:6880/17558 pipe(0x2b5d2080
sd=32 :6789 s=2 pgs=1 cs=1 l=1 c=0x2a
d40dc0).fault (0) Success
0> 2014-05-22 11:19:35.186091 7fa7d908d700 -1 common/Thread.cc:
In function 'void Thread::create(size_t)' thread 7fa7d908d700 time
2014-05-22 11:19:33.914
301
common/Thread.cc: 110: FAILED assert(ret == 0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: (Thread::create(unsigned long)+0x8a) [0x748c9a]
2: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
3: (Accepter::entry()+0x265) [0x863295]
4: (()+0x79d1) [0x7fa7dfdf89d1]
5: (clone()+0x6d) [0x7fa7deb33b6d]
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: /usr/bin/ceph-mon() [0x86b991]
2: (()+0xf710) [0x7fa7dfe00710]
3: (gsignal()+0x35) [0x7fa7dea7d925]
4: (abort()+0x175) [0x7fa7dea7f105]
5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fa7df337a5d]
6: (()+0xbcbe6) [0x7fa7df335be6]
7: (()+0xbcc13) [0x7fa7df335c13]
8: (()+0xbcd0e) [0x7fa7df335d0e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7f2) [0x7a5472]
10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
12: (Accepter::entry()+0x265) [0x863295]
13: (()+0x79d1) [0x7fa7dfdf89d1]
14: (clone()+0x6d) [0x7fa7deb33b6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
-4> 2014-05-22 11:19:35.237629 7fa7d2da1700 2 --
130.246.179.122:6789/0 >> 130.246.179.184:6880/30668 pipe(0x2b5d0500
sd=25 :6789 s=2 pgs=1 cs=1 l=1 c=0xd5b2aa0).reader couldn't read tag,
(0) Success
-3> 2014-05-22 11:19:35.237684 7fa7d2da1700 2 --
130.246.179.122:6789/0 >> 130.246.179.184:6880/30668 pipe(0x2b5d0500
sd=25 :6789 s=2 pgs=1 cs=1 l=1 c=0xd5b2aa0).fault (0) Success
-2> 2014-05-22 11:19:35.349080 7fa7d2341700 2 --
130.246.179.122:6789/0 >> 130.246.179.187:6895/9386 pipe(0xd256400
sd=37 :6789 s=2 pgs=1 cs=1 l=1 c=0xd5b7a60).reader couldn't read tag,
(0) Success
-1> 2014-05-22 11:19:35.349144 7fa7d2341700 2 --
130.246.179.122:6789/0 >> 130.246.179.187:6895/9386 pipe(0xd256400
sd=37 :6789 s=2 pgs=1 cs=1 l=1 c=0xd5b7a60).fault (0) Success
0> 2014-05-22 11:19:35.441249 7fa7d908d700 -1 *** Caught signal
(Aborted) **
in thread 7fa7d908d700
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
1: /usr/bin/ceph-mon() [0x86b991]
2: (()+0xf710) [0x7fa7dfe00710]
3: (gsignal()+0x35) [0x7fa7dea7d925]
4: (abort()+0x175) [0x7fa7dea7f105]
5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7fa7df337a5d]
6: (()+0xbcbe6) [0x7fa7df335be6]
7: (()+0xbcc13) [0x7fa7df335c13]
8: (()+0xbcd0e) [0x7fa7df335d0e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7f2) [0x7a5472]
10: (Thread::create(unsigned long)+0x8a) [0x748c9a]
11: (SimpleMessenger::add_accept_pipe(int)+0x6a) [0x8351ba]
12: (Accepter::entry()+0x265) [0x863295]
13: (()+0x79d1) [0x7fa7dfdf89d1]
14: (clone()+0x6d) [0x7fa7deb33b6d]
But I see some things happening on the system while doing this too:
[root@ ~]# ceph osd pool set ecdata15 pgp_num 4096
set pool 16 pgp_num to 4096
[root@ ~]# ceph status
Traceback (most recent call last):
File "/usr/bin/ceph", line 830, in <module>
sys.exit(main())
File "/usr/bin/ceph", line 590, in main
conffile=conffile)
File "/usr/lib/python2.6/site-packages/rados.py", line 198, in __init__
librados_path = find_library('rados')
File "/usr/lib64/python2.6/ctypes/util.py", line 209, in find_library
return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
File "/usr/lib64/python2.6/ctypes/util.py", line 203, in
_findSoname_ldconfig
os.popen('LANG=C /sbin/ldconfig -p 2>/dev/null').read())
OSError: [Errno 12] Cannot allocate memory
[root@ ~]# lsof | wc
-bash: fork: Cannot allocate memory
[root@ ~]# lsof | wc
21801 211209 3230028
[root@ ~]# ceph status
^CError connecting to cluster: InterruptedOrTimeoutError
^[[A[root@ ~]# lsof | wc
2028 17476 190947
And meanwhile the daemons has then been crashed.
I verified the memory never ran out.
Thanks!
Kenneth
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
----- End message from Gregory Farnum <g...@inktank.com> -----
--
Met vriendelijke groeten,
Kenneth Waegeman
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com