On Thu, May 28, 2015 at 1:04 AM, Kenneth Waegeman
wrote:
>
>
> On 05/27/2015 10:30 PM, Gregory Farnum wrote:
>>
>> On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
>> wrote:
>>>
>>> We are also running a full backup sync to cephfs, using multiple
>>> distributed
>>> rsync streams (with zkrsync),
On 05/27/2015 10:30 PM, Gregory Farnum wrote:
On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
wrote:
We are also running a full backup sync to cephfs, using multiple distributed
rsync streams (with zkrsync), and also ran in this issue today on Hammer
0.94.1 .
After setting the beacon higer
On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
wrote:
> We are also running a full backup sync to cephfs, using multiple distributed
> rsync streams (with zkrsync), and also ran in this issue today on Hammer
> 0.94.1 .
> After setting the beacon higer, and eventually clearing the journal, it
>
We are also running a full backup sync to cephfs, using multiple
distributed rsync streams (with zkrsync), and also ran in this issue
today on Hammer 0.94.1 .
After setting the beacon higer, and eventually clearing the journal, it
stabilized again.
We were using ceph-fuse to mount the cephfs,
the kernel client bug should be fixed by
https://github.com/ceph/ceph-client/commit/72f22efb658e6f9e126b2b0fcb065f66ffd02239
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
On 22/05/2015 20:06, Gregory Farnum wrote:
Ugh. We appear to be trying to allocate too much memory for this event
in the journal dump; we'll need to fix this. :(
It's not even per-event, it tries to load the entire journal into memory
in one go. This a hangover from the old Dumper/Resetter
Alright, bumping that up 10 worked. the MDS server came up and
"recovered". Took about 1 minute.
Thanks again, guys.
--
Adam
On Fri, May 22, 2015 at 2:50 PM, Gregory Farnum wrote:
> On Fri, May 22, 2015 at 12:45 PM, Adam Tygart wrote:
>> Fair enough. Anyway, is it safe to now increase the '
On Fri, May 22, 2015 at 12:45 PM, Adam Tygart wrote:
> Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
> to try and get the mds server functional again?
Yep! Let us know how it goes...
>
> I realize there is nothing simple about the things that are being
> accomplished her
Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
to try and get the mds server functional again?
I realize there is nothing simple about the things that are being
accomplished here, and thank everyone for their hard work on making
this stuff work as well as it does.
--
Adam
On Fri, May 22, 2015 at 12:34 PM, Adam Tygart wrote:
> I believe I grabbed all of theses files:
>
> for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
> metadata get ${x} /tmp/metadata/${x}; done
> tar czSf journal.tar.gz /tmp/metadata
>
> https://drive.google.com/file/d/0B4XF1RWjuGh
I believe I grabbed all of theses files:
for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
metadata get ${x} /tmp/metadata/${x}; done
tar czSf journal.tar.gz /tmp/metadata
https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing
When this crash occurred, the r
On Fri, May 22, 2015 at 11:34 AM, Adam Tygart wrote:
> On Fri, May 22, 2015 at 11:47 AM, John Spray wrote:
>>
>>
>> On 22/05/2015 15:33, Adam Tygart wrote:
>>>
>>> Hello all,
>>>
>>> The ceph-mds servers in our cluster are performing a constant
>>> boot->replay->crash in our systems.
>>>
>>> I ha
On Fri, May 22, 2015 at 11:47 AM, John Spray wrote:
>
>
> On 22/05/2015 15:33, Adam Tygart wrote:
>>
>> Hello all,
>>
>> The ceph-mds servers in our cluster are performing a constant
>> boot->replay->crash in our systems.
>>
>> I have enable debug logging for the mds for a restart cycle on one of
I notice in both logs, the last entry before the MDS restart/failover is when
the mds is replaying the journal and gets to
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/
2015-05-22 09:59:19.116231 7f9d930c1700 10 mds.0.journal EMetaBlob.replay for
[2,head] had [inode 13f8e31 [...2,head]
/hom
On 22/05/2015 15:33, Adam Tygart wrote:
Hello all,
The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.
I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].
You found a bug, or more correctly you probably found mult
I knew I forgot to include something with my initial e-mail.
Single active with failover.
dumped mdsmap epoch 30608
epoch 30608
flags 0
created 2015-04-02 16:15:55.209894
modified2015-05-22 11:39:15.992774
tableserver 0
root0
session_timeout 60
session_autoclose 300
max_
I've experienced MDS issues in the past, but nothing sticks out to me in your
logs.
Are you using a single active MDS with failover, or multiple active MDS?
--Lincoln
On May 22, 2015, at 10:10 AM, Adam Tygart wrote:
> Thanks for the quick response.
>
> I had 'debug mds = 20' in the first log
Thanks for the quick response.
I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing
Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
times out and then the mds gets respawn
Hi Adam,
You can get the MDS to spit out more debug information like so:
# ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'
At least then you can see where it's at when it crashes.
--Lincoln
On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
> Hello all,
>
> The ceph-mds servers
Hello all,
The ceph-mds servers in our cluster are performing a constant
boot->replay->crash in our systems.
I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].
Kernel debug from cephfs client during reconnection attempts:
[732586.352173] ceph: mdsc delayed_work
20 matches
Mail list logo