Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)

Kenneth Waegeman Wed, 16 Apr 2014 06:01:48 -0700


----- Message from "Yan, Zheng" <uker...@gmail.com> ---------
   Date: Wed, 16 Apr 2014 14:45:37 +0800
   From: "Yan, Zheng" <uker...@gmail.com>
Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
     To: Stijn De Weirdt <stijn.dewei...@ugent.be>
     Cc: ceph-users@lists.ceph.com

On Wed, Apr 16, 2014 at 2:08 PM, Stijn De Weirdt
<stijn.dewei...@ugent.be> wrote:

What do you mean by the MDS journal? Where can I find this journal?
Can a better CPU solve the slow trimming? ( Now 2 hexacore AMD Opteron
4334)


MDS uses journal to record recent metadata update. the journal is
stored in metadata pool
(object name 200.*). The speed of trimming log is limited by OSD
performance. Better CPU
does not help.


i'm a bit confused now. so the metadata journal is stored in the metadata
pool (which is using OSDs that also have journals), but where is the
metadata-data (for lack of better wording) stored? also in metadata pool?

we have configured a separate set of OSDs and modified crushmap so the
metadata pools uses those OSDs. if these are only used as journal for the
metadata, then it's not so odd the metadata journal is way ahead.

metadata are also stored in metadata data. When handling a update
request, MDS writes metadata changes to the metadata journal. MDS
write metadata changes to corresponding metadata objects when it trims
the journal.


Please try
https://github.com/ceph/ceph/commit/6dfed2693e9002dbaf82b3dc1a637e1c53878fe3
I hope it can solve this issue.

We are now running the patched version, thank you for that! We'llmonitor what happens.

Is there a way to see how much of backlog we have?


ok, we'll rebuild and try asap

stijn


Regards
Yan, Zheng

Thanks!


Regards
Yan, Zheng

Thanks!

Kenneth

----- Message from Stijn De Weirdt <stijn.dewei...@ugent.be> ---------
    Date: Fri, 04 Apr 2014 20:31:34 +0200
    From: Stijn De Weirdt <stijn.dewei...@ugent.be>

Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
      To: "Yan, Zheng" <uker...@gmail.com>
      Cc: ceph-users@lists.ceph.com

hi yan,

(taking the list in CC)

On 04/04/2014 04:44 PM, Yan, Zheng wrote:




On Thu, Apr 3, 2014 at 2:52 PM, Stijn De Weirdt
<stijn.dewei...@ugent.be>
wrote:




hi,

latest pprof output attached.

this is no kernel client, this is ceph-fuse on EL6. starting the mds
without
any ceph-fuse mounts works without issue. mounting ceph-fuse
afterwards
also
works fine. simple filesystem operations work as expected.

we'll check the state of the fuse mount via fusectl once we've
reproduced
the issue.





One dark side of fuse client is that there is no way to purge inodes
in kernel inode caches. Each cached inode in kernel pins
corresponding
inode in ceph-mds. I think that's why ceph-fuse and ceph-mds use so
many memory. Could you try executing "echo 3 >
/proc/sys/vm/drop_caches" periodically on the ceph-fuse mount
machine.




thanks a lot for the suggestion. that night very well explain the
issue.
we're currently trying looking into other issues that we tought was
related
(extremely low mdtest performance), but we will retry your suggestion
when
we start to reproduce the condition



stijn


Regards
Yan, Zheng



stijn


On 04/03/2014 02:39 AM, Yan, Zheng wrote:





which version of kernel client did you use? please send out context
of
client node's /sys/kernel/debug/ceph/*/caps when the MDS uses lots
memory.

Regards
Yan, Zheng

On Thu, Apr 3, 2014 at 2:58 AM, Stijn De Weirdt
<stijn.dewei...@ugent.be>
wrote:





wow. kudos for integrating this in ceph. more projects should do
it
like
that!

anyway, in attachement a gzipped ps file. heap is at 4.4GB, top
reports
6.5GB mem usage.

care to point out what to look for? i'll send a new one when the
usage
is
starting to cause swapping.

thanks,

stijn


On 04/02/2014 06:35 PM, Gregory Farnum wrote:






Did you see

http://ceph.com/docs/master/rados/troubleshooting/memory-profiling?
That should have what you need to get started, although you'll
also
need to learn the basics of using the heap analysis tools
elsewhere.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 2, 2014 at 9:23 AM, Stijn De Weirdt
<stijn.dewei...@ugent.be>
wrote:

hi,

1) How big and what shape the filesystem is. Do you have some
extremely large directory that the MDS keeps trying to load
and
then
dump?







anyway to extract this from the mds without having to start it?
as
it
was an rsync operation, i can try to locate possible candidates
on
the
source filesystem, but what would be considered "large"?







total number of files 13M, spread over 800k directories, but
it's
unclear
how far the sync was at time of failing. i've not found a good
way
to
for
directories with lots of files and/or subdirs.

2) Use tcmalloc's heap analyzer to see where all the memory is
being
allocated.







we'll giv ethat a try







i run ceph-mds with HEAPCHECK=normal (via the init script), but
how
can
we
"stop" mds without killing it? the heapchecker only seems to
dump
at
the
end
of a run, maybe there's a way to have intermediate dump like
valgrind,
but
the documentation is not very helpful.

stijn

3) Look through the logs for when the beacon fails (the first
of
"mds.0.16 is_laggy 600.641332 > 15 since last acked beacon")
and
see
if there's anything tell-tale going on at the time.








anything in particular we should be looking for?

the log goes as follows:
mds starts around 11:43
...

2014-04-01 11:44:23.658583 7ffec89c6700  1 mds.0.server
reconnect_clients -- 1 sessions
2014-04-01 11:44:41.212488 7ffec89c6700  0 log [DBG] :
reconnect
by
client.4585 10.141.8.199:0/3551 after 17.553854
2014-04-01 11:44:45.692237 7ffec89c6700  1 mds.0.10
reconnect_done
2014-04-01 11:44:45.996384 7ffec89c6700  1 mds.0.10
handle_mds_map
i
am now mds.0.10
2014-04-01 11:44:45.996388 7ffec89c6700  1 mds.0.10
handle_mds_map
state change up:reconnect --> up:rejoin
2014-04-01 11:44:45.996390 7ffec89c6700  1 mds.0.10
rejoin_start
2014-04-01 11:49:53.158471 7ffec89c6700  1 mds.0.10
rejoin_joint_start








then lots (4667 lines) of







2014-04-01 11:50:10.237035 7ffebc844700  0 --
10.141.8.180:6837/55117






10.141.8.180:6789/0 pipe(0x38a7da00 sd=104 :41115 s=4 pgs=0
cs=0






l=1 c=0x6513e8840).connect got RESETSESSION but no longer
connecting








with one intermediate







2014-04-01 11:51:50.181354 7ffebcf4b700  0 --
10.141.8.180:6837/55117






10.141.8.180:6789/0 pipe(0x10e282580 sd=103 :0 s=1 pgs=0
cs=0
l=1






c=0xc77d5ee0).fault









then sudden change







2014-04-01 11:57:30.440554 7ffebcd49700  0 --
10.141.8.180:6837/55117






10.141.8.182:6789/0 pipe(0xa1534100 sd=104 :48176 s=4 pgs=0
cs=0






l=1 c=0xd99b11e0).connect got RESETSESSION but no longer
connecting
2014-04-01 11:57:30.722607 7ffebec68700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0
cs=0






l=1 c=0xc48a3f40).connect got BADAUTHORIZER
2014-04-01 11:57:30.722669 7ffebec68700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0
cs=0






l=1 c=0xc48a3f40).connect got BADAUTHORIZER
2014-04-01 11:57:30.722885 7ffebec68700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0
cs=0
l=1






c=0xc48a3f40).connect got BADAUTHORIZER
2014-04-01 11:57:30.722945 7ffebec68700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0
cs=0
l=1






c=0xc48a3f40).connect got BADAUTHORIZER








followed by lots of







2014-04-01 11:57:30.738562 7ffebec68700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0xead9fa80 sd=57 :0 s=1 pgs=0 cs=0
l=1






c=0x10e5d280).fault








with sporadic







2014-04-01 11:57:32.431219 7ffebeb67700  0 --
10.141.8.180:6837/55117






10.141.8.182:6789/0 pipe(0xef85cd80 sd=103 :41218 s=4 pgs=0
cs=0






l=1 c=0x130590dc0).connect got RESETSESSION but no longer
connecting









until the dmup







2014-04-01 11:59:27.612850 7ffebea66700  0 --
10.141.8.180:6837/55117






10.141.8.181:6789/0 pipe(0xe3036400 sd=103 :0 s=1 pgs=0 cs=0
l=1






c=0xa7be300).fault
2014-04-01 11:59:27.639009 7ffec89c6700 -1 mds/CDir.cc: In
function
'void CDir::_omap_fetched(ceph::bufferlist&,
std::map<std::basic_string<char, std::char_traits<char>,
std::allocator\
<char> >, ceph::buffer::list,
std::less<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::pair<const std::basic_string<char,
std::char_trait\
s<char>, std::allocator<char> >, ceph::buffer::list> > >&,
const
std::string&, int)' thread 7ffec89c6700 time 2014-04-01
11:59:27.620684
mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r ==
-61)

    ceph version 0.78
(f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
    1: (CDir::_omap_fetched(ceph::buffer::list&,
std::map<std::string,
ceph::buffer::list, std::less<std::string>,
std::allocator<std::pair<std::string const,
ceph::buffer::list>

&,


st\
d::string const&, int)+0x4d71) [0x77c3c1]
    2: (Context::complete(int)+0x9) [0x56bb79]
    3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x11a6)
[0x806dd6]
    4: (MDS::handle_core_message(Message*)+0x9c7) [0x5901d7]
    5: (MDS::_dispatch(Message*)+0x2f) [0x59028f]
    6: (MDS::ms_dispatch(Message*)+0x1ab) [0x591d4b]
    7: (DispatchQueue::entry()+0x582) [0x902072]
    8: (DispatchQueue::DispatchThread::entry()+0xd) [0x85ef4d]
    9: /lib64/libpthread.so.0() [0x34c36079d1]
    10: (clone()+0x6d) [0x34c32e8b6d]
    NOTE: a copy of the executable, or `objdump -rdS
<executable>`
is
needed to interpret this.

--- begin dump of recent events ---
-10000> 2014-04-01 11:59:27.137779 7ffec89c6700  5 mds.0.10
initiating
monitor reconnect; maybe we're not the slow one
    -9999> 2014-04-01 11:59:27.137787 7ffec89c6700 10
monclient(hunting):
_reopen_session rank -1 name
    -9998> 2014-04-01 11:59:27.137790 7ffec89c6700  1 --
10.141.8.180:6837/55117 mark_down 0x1326be460 -- 0x2cae5f00
    -9997> 2014-04-01 11:59:27.137807 7ffebf672700  2 --
10.141.8.180:6837/55117 >> 10.141.8.181:6789/0 pipe(0x2cae5f00
sd=57
:0 s=4 pgs=0 cs=0 l=1 c=0x1326be460).connect error 10.141.8.1\
81:6789/0, 104: Connection reset by peer
    -9996> 2014-04-01 11:59:27.137819 7ffebf672700  3 --
10.141.8.180:6837/55117 >> 10.141.8.181:6789/0 pipe(0x2cae5f00
sd=57
:0 s=4 pgs=0 cs=0 l=1 c=0x1326be460).connect fault, but state\
    = closed != connecting, stopping
    -9995> 2014-04-01 11:59:27.137824 7ffec89c6700 10
monclient(hunting):
picked mon.ceph003 con 0x3af54ba0 addr 10.141.8.182:6789/0
    -9994> 2014-04-01 11:59:27.137832 7ffec89c6700 10
monclient(hunting):
_send_mon_message to mon.ceph003 at 10.141.8.182:6789/0

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 2, 2014 at 3:39 AM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:








----- Message from Gregory Farnum <g...@inktank.com>
---------
       Date: Tue, 1 Apr 2014 09:03:17 -0700
       From: Gregory Farnum <g...@inktank.com>

Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus
error)
         To: "Yan, Zheng" <uker...@gmail.com>
         Cc: Kenneth Waegeman <kenneth.waege...@ugent.be>,
ceph-users
<ceph-users@lists.ceph.com>

On Tue, Apr 1, 2014 at 7:12 AM, Yan, Zheng
<uker...@gmail.com>
wrote:








On Tue, Apr 1, 2014 at 10:02 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:








After some more searching, I've found that the source of
the
problem is
with
the mds and not the mon.. The mds crashes, generates a
core
dump
that
eats
the local space, and in turn the monitor (because of
leveldb)
crashes.

The error in the mds log of one host:

2014-04-01 15:46:34.414615 7f870e319700  0 --
10.141.8.180:6836/13152 >>
10.141.8.180:6789/0 pipe(0x517371180 sd=54 :42439 s=4
pgs=0
cs=0
l=1
c=0x147ac780).connect got RESETSESSION but no longer
connecting
2014-04-01 15:46:34.438792 7f871194f700  0 --
10.141.8.180:6836/13152 >>
10.141.8.180:6789/0 pipe(0x1b099f580 sd=8 :43150 s=4 pgs=0
cs=0
l=1
c=0x1fd44360).connect got RESETSESSION but no longer
connecting
2014-04-01 15:46:34.439028 7f870e319700  0 --
10.141.8.180:6836/13152 >>
10.141.8.182:6789/0 pipe(0x13aa64880 sd=54 :37085 s=4
pgs=0
cs=0
l=1
c=0x1fd43de0).connect got RESETSESSION but no longer
connecting
2014-04-01 15:46:34.468257 7f871b7ae700 -1 mds/CDir.cc: In
function
'void
CDir::_omap_fetched(ceph::bufferlist&,
std::map<std::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
ceph::buffer::list,
std::less<std::basic_string<char, std::char_traits<char>,
std::allocator<char> > >, std::allocator<std::pair<const
std::basic_string<char, std::char_traits<char>,
std::allocator<char> >,
ceph::buffer::list> > >&, const std::string&, int)' thread
7f871b7ae700
time
2014-04-01 15:46:34.448320
mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r ==
-61)


could you use gdb to check what is value of variable 'r' .









If you look at the crash dump log you can see the return
value
in
the
osd_op_reply message:
-1> 2014-04-01 15:46:34.440860 7f871b7ae700  1 --
10.141.8.180:6836/13152 <== osd.3 10.141.8.180:6827/4366
33077
====
osd_op_reply(4179177 100001f2ef1.00000000 [omap-get-header
0~0,omap-get-vals 0~16] v0'0 uv0 ack = -108 (Cannot send
after
transport endpoint shutdown)) v6 ==== 229+0+0 (958358678 0
0)
0x2cff7aa80 con 0x37ea3c0

-108, which is ESHUTDOWN, but we also use it (via the 108
constant,
I
think because ESHUTDOWN varies across platforms) as
EBLACKLISTED.
So it looks like this is itself actually a symptom of
another
problem
that is causing the MDS to get timed out on the monitor. If
a
core
dump is "eating the local space", maybe the MDS is stuck in
an
infinite allocation loop of some kind? How big are your
disks,
Kenneth? Do you have any information on how much CPU/memory
the
MDS
was using before this?









I monitored the mds process after restart:

     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
TIME+
COMMAND
19215 root      20   0 6070m 5.7g 5236 S 778.6 18.1   1:27.54
ceph-mds
19215 root      20   0 7926m 7.5g 5236 S 179.2 23.8   2:44.39
ceph-mds
19215 root      20   0 12.4g  12g 5236 S 157.2 38.8   3:43.47
ceph-mds
19215 root      20   0 16.6g  16g 5236 S 144.4 52.0   4:15.01
ceph-mds
19215 root      20   0 19.9g  19g 5236 S 137.2 62.5   4:35.83
ceph-mds
19215 root      20   0 24.5g  24g 5224 S 136.5 77.0   5:04.66
ceph-mds
19215 root      20   0 25.8g  25g 2944 S 33.7 81.2   5:13.74
ceph-mds
19215 root      20   0 26.0g  25g 2916 S 24.6 81.7   5:19.07
ceph-mds
19215 root      20   0 26.1g  25g 2916 S 13.0 82.1   5:22.16
ceph-mds
19215 root      20   0 27.7g  26g 1856 S 100.0 85.8   5:36.46
ceph-mds

Then it crashes. I changed the core dump location out of the
root
fs,
the
core dump is indeed about 26G

My disks:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2       9.9G  2.9G  6.5G  31% /
tmpfs            16G     0   16G   0% /dev/shm
/dev/sda1       248M   53M  183M  23% /boot
/dev/sda4       172G   61G  112G  35% /var/lib/ceph/log/sda4
/dev/sdb        187G   61G  127G  33% /var/lib/ceph/log/sdb
/dev/sdc        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdc
/dev/sdd        3.7T  1.5T  2.2T  41% /var/lib/ceph/osd/sdd
/dev/sde        3.7T  1.4T  2.4T  37% /var/lib/ceph/osd/sde
/dev/sdf        3.7T  1.5T  2.3T  39% /var/lib/ceph/osd/sdf
/dev/sdg        3.7T  2.1T  1.7T  56% /var/lib/ceph/osd/sdg
/dev/sdh        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdh
/dev/sdi        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdi
/dev/sdj        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdj
/dev/sdk        3.7T  2.1T  1.6T  58% /var/lib/ceph/osd/sdk
/dev/sdl        3.7T  1.7T  2.0T  46% /var/lib/ceph/osd/sdl
/dev/sdm        3.7T  1.5T  2.2T  41% /var/lib/ceph/osd/sdm
/dev/sdn        3.7T  1.4T  2.3T  38% /var/lib/ceph/osd/sdn

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com










----- End message from Gregory Farnum <g...@inktank.com>
-----



--

Met vriendelijke groeten,
Kenneth Waegeman

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






----- End message from Stijn De Weirdt <stijn.dewei...@ugent.be> -----


--

Met vriendelijke groeten,
Kenneth Waegeman

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





----- End message from "Yan, Zheng" <uker...@gmail.com> -----


--

Met vriendelijke groeten,
Kenneth Waegeman

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



----- End message from "Yan, Zheng" <uker...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)

Reply via email to