executor always been removed. someone encountered same issue https://groups.google.com/forum/#!topic/spark-users/-mYn6BF-Y5Y
------------- 14/07/02 17:41:16 INFO storage.BlockManagerMasterActor: Trying to remove executor 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster. 14/07/02 17:41:16 INFO storage.BlockManagerMaster: Removed 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor 14/07/02 17:41:16 DEBUG spark.MapOutputTrackerMaster: Increasing epoch to 10 14/07/02 17:41:16 INFO scheduler.DAGScheduler: Host gained which was in lost list earlier: bigdata001 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name: TaskSet_0, runningTasks: 0 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name: TaskSet_0, runningTasks: 0 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 12 on executor 20140616-143932-1694607552-5050-4080-3: bigdata004 (NODE_LOCAL) 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 10785 bytes in 1 ms 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 13 on executor 20140616-104524-1694607552-5050-26919-3: bigdata002 (NODE_LOCAL 2014-07-02 12:01 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > also this one in warning log: > > E0702 11:35:08.869998 17840 slave.cpp:2310] Container > 'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor > '20140616-104524-1694607552-5050-26919-1' of framework > '20140702-113428-1694607552-5050-17766-0000' failed to start: Failed to > fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit > status 32512 > > > 2014-07-02 11:46 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > > Here is the log: >> >> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor container >> for executor 20140616-104524-1694607552-5050-26919-1 of framework >> 20140702-102939-1694607552-5050-14846-0000: Not monitored >> >> >> 2014-07-02 1:45 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: >> >> Can you post the logs from any of the dying executors? >>> >>> >>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li <liqingyang1...@gmail.com> >>> wrote: >>> >>> > i am using mesos0.19 and spark0.9.0 , the mesos cluster is started, >>> when I >>> > using spark-shell to submit one job, the tasks always lost. here is >>> the >>> > log: >>> > ---------- >>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list >>> > earlier: bigdata005 >>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 >>> on >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005 >>> (PROCESS_LOCAL) >>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 >>> bytes >>> > in 0 ms >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for >>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0 >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0) >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost: >>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427) >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove >>> executor >>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster. >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed >>> > 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for >>> > 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0 >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1) >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost: >>> > 20140616-143932-1694607552-5050-4080-2 (epoch 3428) >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove >>> executor >>> > 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster. >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed >>> > 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list >>> > earlier: bigdata005 >>> > 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list >>> > earlier: bigdata001 >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 >>> on >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005 >>> (PROCESS_LOCAL) >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 >>> bytes >>> > in 0 ms >>> > 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 >>> on >>> > executor 20140616-104524-1694607552-5050-26919-1: bigdata001 >>> > (PROCESS_LOCAL) >>> > 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 >>> bytes >>> > in 0 ms >>> > >>> > >>> > it seems other guy has also encountered such problem, >>> > >>> > >>> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3c201305161047069952...@nfs.iscas.ac.cn%3E >>> > >>> >> >> >