Hi All,

I am running the MapReduce wordcount code (on a ceph cluster consisting of
2 VMs) on a data set consisting of 5000 odd files (approx. 10gb size in
total). Periodically, the ceph health says that the mds is
laggy/unresponsive, and I get messages like the following:

13/04/24 10:41:00 INFO mapred.JobClient:  map 11% reduce 3%
13/04/24 10:42:36 INFO mapred.JobClient:  map 12% reduce 3%
13/04/24 10:42:45 INFO mapred.JobClient:  map 12% reduce 4%
13/04/24 10:44:08 INFO mapred.JobClient:  map 13% reduce 4%
13/04/24 10:45:29 INFO mapred.JobClient:  map 14% reduce 4%
13/04/24 11:06:31 INFO mapred.JobClient: Task Id :
attempt_201304241023_0001_m_000706_0, Status : FAILED
Task attempt_201304241023_0001_m_000706_0 failed to report status for 600
seconds. Killing!
Task attempt_201304241023_0001_m_000706_0 failed to report status for 600
seconds. Killing!

I then have to manually restart the mds again, and the process continues
execution. Can someone please tell me the reason for this, and how to solve
it? Pasting my ceph.conf file below:

[global]
        auth client required = none
        auth cluster required = none
        auth service required = none

[osd]
        osd journal data = 1000
        filestore xattr use omap = true
#       osd data = /var/lib/ceph/osd/ceph-$id

[mon.a]
        host = varunc4-virtual-machine
        mon addr = 10.72.148.209:6789
#       mon data = /var/lib/ceph/mon/ceph-a

[mds.a]
        host = varunc4-virtual-machine
#       mds data = /var/lib/ceph/mds/ceph-a

[osd.0]
        host = varunc4-virtual-machine

[osd.1]
        host = varunc5-virtual-machine

Regards
Varun
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to