I notice in both logs, the last entry before the MDS restart/failover is when
the mds is replaying the journal and gets to
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/
2015-05-22 09:59:19.116231 7f9d930c1700 10 mds.0.journal EMetaBlob.replay for
[2,head] had [inode 100003f8e31 [...2,head]
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/ auth v20776 f(v0 m2015-05-22
02:34:09.000000 357=357+0) n(v1 rc2015-05-22 02:34:09.000000 b71340955004
358=357+1) (iversion lock) | dirfrag=1 dirty=1 0x6ded9c8]
2015-05-22 08:04:31.993007 7f87afb2f700 10 mds.0.journal EMetaBlob.replay for
[2,head] had [inode 100003f8e31 [...2,head]
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/ auth v20776 f(v0 m2015-05-22
02:34:09.000000 357=357+0) n(v1 rc2015-05-22 02:34:09.000000 b71340955004
358=357+1) (iversion lock) | dirfrag=1 dirty=1 0x76a59c8]
Maybe there's some problem in this part of the journal? Or maybe that's the end
of the journal and it crashes afterwards? No idea :( Hopefully one of the devs
can weigh in.
--Lincoln
On May 22, 2015, at 11:40 AM, Adam Tygart wrote:
> I knew I forgot to include something with my initial e-mail.
>
> Single active with failover.
>
> dumped mdsmap epoch 30608
> epoch 30608
> flags 0
> created 2015-04-02 16:15:55.209894
> modified 2015-05-22 11:39:15.992774
> tableserver 0
> root 0
> session_timeout 60
> session_autoclose 300
> max_file_size 17592186044416
> last_failure 30606
> last_failure_osd_epoch 24298
> compat compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in
> separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in 0
> up {0=20284976}
> failed
> stopped
> data_pools 25
> metadata_pool 27
> inline_data disabled
> 20285024: 10.5.38.2:7021/32024 'hobbit02' mds.-1.0 up:standby seq 1
> 20346784: 10.5.38.1:6957/223554 'hobbit01' mds.-1.0 up:standby seq 1
> 20284976: 10.5.38.13:6926/66700 'hobbit13' mds.0.1696 up:replay seq 1
>
> --
> Adam
>
> On Fri, May 22, 2015 at 11:37 AM, Lincoln Bryant <[email protected]>
> wrote:
>> I've experienced MDS issues in the past, but nothing sticks out to me in
>> your logs.
>>
>> Are you using a single active MDS with failover, or multiple active MDS?
>>
>> --Lincoln
>>
>> On May 22, 2015, at 10:10 AM, Adam Tygart wrote:
>>
>>> Thanks for the quick response.
>>>
>>> I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this
>>> one:
>>> https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing
>>>
>>> Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
>>> times out and then the mds gets respawned.
>>>
>>> --
>>> Adam
>>>
>>> On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant <[email protected]>
>>> wrote:
>>>> Hi Adam,
>>>>
>>>> You can get the MDS to spit out more debug information like so:
>>>>
>>>> # ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'
>>>>
>>>> At least then you can see where it's at when it crashes.
>>>>
>>>> --Lincoln
>>>>
>>>> On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> The ceph-mds servers in our cluster are performing a constant
>>>>> boot->replay->crash in our systems.
>>>>>
>>>>> I have enable debug logging for the mds for a restart cycle on one of
>>>>> the nodes[1].
>>>>>
>>>>> Kernel debug from cephfs client during reconnection attempts:
>>>>> [732586.352173] ceph: mdsc delayed_work
>>>>> [732586.352178] ceph: check_delayed_caps
>>>>> [732586.352182] ceph: lookup_mds_session ffff88202f01c000 210
>>>>> [732586.352185] ceph: mdsc get_session ffff88202f01c000 210 -> 211
>>>>> [732586.352189] ceph: send_renew_caps ignoring mds0 (up:replay)
>>>>> [732586.352192] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680
>>>>> [732586.352195] ceph: mdsc put_session ffff88202f01c000 211 -> 210
>>>>> [732586.352198] ceph: mdsc delayed_work
>>>>> [732586.352200] ceph: check_delayed_caps
>>>>> [732586.352202] ceph: lookup_mds_session ffff881036cbf800 1
>>>>> [732586.352205] ceph: mdsc get_session ffff881036cbf800 1 -> 2
>>>>> [732586.352207] ceph: send_renew_caps ignoring mds0 (up:replay)
>>>>> [732586.352210] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680
>>>>> [732586.352212] ceph: mdsc put_session ffff881036cbf800 2 -> 1
>>>>> [732591.357123] ceph: mdsc delayed_work
>>>>> [732591.357128] ceph: check_delayed_caps
>>>>> [732591.357132] ceph: lookup_mds_session ffff88202f01c000 210
>>>>> [732591.357135] ceph: mdsc get_session ffff88202f01c000 210 -> 211
>>>>> [732591.357139] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680
>>>>> [732591.357142] ceph: mdsc put_session ffff88202f01c000 211 -> 210
>>>>> [732591.357145] ceph: mdsc delayed_work
>>>>> [732591.357147] ceph: check_delayed_caps
>>>>> [732591.357149] ceph: lookup_mds_session ffff881036cbf800 1
>>>>> [732591.357152] ceph: mdsc get_session ffff881036cbf800 1 -> 2
>>>>> [732591.357154] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680
>>>>> [732591.357157] ceph: mdsc put_session ffff881036cbf800 2 -> 1
>>>>> [732596.362076] ceph: mdsc delayed_work
>>>>> [732596.362081] ceph: check_delayed_caps
>>>>> [732596.362084] ceph: lookup_mds_session ffff88202f01c000 210
>>>>> [732596.362087] ceph: mdsc get_session ffff88202f01c000 210 -> 211
>>>>> [732596.362091] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680
>>>>> [732596.362094] ceph: mdsc put_session ffff88202f01c000 211 -> 210
>>>>> [732596.362097] ceph: mdsc delayed_work
>>>>> [732596.362099] ceph: check_delayed_caps
>>>>> [732596.362101] ceph: lookup_mds_session ffff881036cbf800 1
>>>>> [732596.362104] ceph: mdsc get_session ffff881036cbf800 1 -> 2
>>>>> [732596.362106] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680
>>>>> [732596.362109] ceph: mdsc put_session ffff881036cbf800 2 -> 1
>>>>>
>>>>> Anybody have any debugging tips, or have any ideas on how to get an mds
>>>>> stable?
>>>>>
>>>>> Server info: CentOS 7.1 with Ceph 0.94.1
>>>>> Client info: Gentoo, kernel cephfs. 3.19.5-gentoo
>>>>>
>>>>> I'd reboot the client, but at this point, I don't believe this is a
>>>>> client issue.
>>>>>
>>>>> [1]
>>>>> https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing
>>>>>
>>>>> --
>>>>> Adam
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected]
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com