Re: Problems running Spark job on mesos in fine-grained mode

Sébastien Rainville Tue, 24 Jun 2014 20:50:20 -0700

Hi Mayur,

I use primarily Scala, but I tested with pyspark, and it's working fine too
post the patch.


Thanks,

- Sebastien


On Tue, Jun 24, 2014 at 6:08 PM, Mayur Rustagi <[email protected]>
wrote:

> Hi Sebastien,
> Are you using Pyspark by any chance, is that working for you (post the
> patch?)
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco <[email protected]> wrote:
>
>> I'm getting the same behavior and it's crucial I get it fixed for an
>> evaluation of Spark + Mesos within my company.
>>
>> I'm bumping +1 for the request of putting this fix in the 1.0.1 if
>> possible!
>>
>> thanks,
>> Federico
>>
>>
>> 2014-06-20 20:51 GMT+02:00 Sébastien Rainville <
>> [email protected]>:
>>
>> Hi,
>>>
>>> this is just a follow-up regarding this issue. Turns out that it's
>>> caused by a bug in Spark. I created a case for it:
>>> https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch.
>>>
>>> Any chance this could be included in the 1.0.1 release?
>>>
>>> Thanks,
>>>
>>> - Sebastien
>>>
>>>
>>>
>>> On Tue, Jun 17, 2014 at 2:57 PM, Sébastien Rainville <
>>> [email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm having trouble running spark on mesos in fine-grained mode. I'm
>>>> running spark 1.0.0 and mesos 0.18.0. The tasks are failing randomly, which
>>>> most of the time, but not always, cause the job to fail. The same code is
>>>> running fine in coarse-grained mode. I see the following exceptions in the
>>>> logs of the spark driver:
>>>>
>>>> W0617 10:57:36.774382  8735 sched.cpp:901] Attempting to launch task
>>>> 21 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>>> W0617 10:57:36.774433  8735 sched.cpp:901] Attempting to launch task
>>>> 22 with an unknown offer 20140416-011500-1369465866-5050-26096-52332715
>>>> 14/06/17 10:57:36 INFO TaskSetManager: Re-queueing tasks for
>>>> 201311011608-1369465866-5050-9189-46 from TaskSet 0.0
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 22 (task 0.0:2)
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 19 (task 0.0:0)
>>>> 14/06/17 10:57:36 WARN TaskSetManager: Lost TID 21 (task 0.0:1)
>>>> 14/06/17 10:57:36 INFO DAGScheduler: Executor lost:
>>>> 201311011608-1369465866-5050-9189-46 (epoch 0)
>>>> 14/06/17 10:57:36 INFO BlockManagerMasterActor: Trying to remove
>>>> executor 201311011608-1369465866-5050-9189-46 from BlockManagerMaster.
>>>> 14/06/17 10:57:36 INFO BlockManagerMaster: Removed
>>>> 201311011608-1369465866-5050-9189-46 successfully in removeExecutor
>>>> 14/06/17 10:57:36 DEBUG MapOutputTrackerMaster: Increasing epoch to 1
>>>> 14/06/17 10:57:36 INFO DAGScheduler: Host added was in lost list
>>>> earlier: ca1-dcc1-0065.lab.mtl
>>>>
>>>> I don't see any exceptions in the spark executor logs. The only error
>>>> message I found in mesos itself is warnings in the mesos master:
>>>>
>>>> W0617 10:57:36.816748 26100 master.cpp:1615] Failed to validate task 21
>>>> : Task 21 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-11919, 1192
>>>> 1-11995, 11997-11999]; cpus(*):1
>>>> W0617 10:57:36.819807 26100 master.cpp:1615] Failed to validate task 22
>>>> : Task 22 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-11919, 1192
>>>> 1-11995, 11997-11999]; cpus(*):1
>>>> W0617 10:57:36.932287 26102 master.cpp:1615] Failed to validate task 28
>>>> : Task 28 attempted to use cpus(*):1 combined with already used cpus(*):1;
>>>> mem(*):2048 is greater than offered cpus(*):1; mem(*):3216; disk(*):98304;
>>>> ports(*):[11900-
>>>> 11960, 11962-11978, 11980-11999]
>>>> W0617 11:05:52.783133 26098 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-46 on slave
>>>> 201311011608-1369465866-5050-9189-46 (ca1-dcc1-0065.lab.mtl)
>>>> W0617 11:05:52.787739 26103 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-34 on slave
>>>> 201311011608-1369465866-5050-9189-34 (ca1-dcc1-0053.lab.mtl)
>>>> W0617 11:05:52.790292 26102 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-59 on slave
>>>> 201311011608-1369465866-5050-9189-59 (ca1-dcc1-0079.lab.mtl)
>>>> W0617 11:05:52.800649 26099 master.cpp:2106] Ignoring unknown exited
>>>> executor 201311011608-1369465866-5050-9189-18 on slave
>>>> 201311011608-1369465866-5050-9189-18 (ca1-dcc1-0027.lab.mtl)
>>>> ... (more of those "Ignoring unknown exited executor")
>>>>
>>>>
>>>> I analyzed the difference in between the execution of the same job in
>>>> coarse-grained mode and fine-grained mode, and I noticed that in the
>>>> fine-grained mode the tasks get executed on executors different than the
>>>> ones reported in spark, as if spark and mesos get out of sync as to which
>>>> executor is responsible for which task. See the following:
>>>>
>>>>
>>>> Coarse-grained mode:
>>>>
>>>>  Spark Mesos Task IndexTask ID ExecutorStatusTask ID (UI)Task Name
>>>> Task ID (logs)ExecutorState 0066SUCCESS 4"Task 4"0 66RUNNING1 159
>>>> SUCCESS0 "Task 0"159 RUNNING22 54SUCCESS10"Task 10" 254RUNNING 33128
>>>> SUCCESS6"Task 6" 3128RUNNING ...
>>>>
>>>> Fine-grained mode:
>>>>
>>>>  Spark Mesos Task IndexTask ID ExecutorTask ID (UI)Task NameTask ID
>>>> (logs) ExecutorState0 23108SUCCESS 23"task 0.0:0"23 27FINISHED0 1965
>>>> FAILED19 "task 0.0:0"1986 FINISHED1 2165FAILED Mesos executor was
>>>> never created124 92SUCCESS24"task 0.0:1" 24129FINISHED 22265 FAILEDMesos
>>>> executor was never created 225100SUCCESS 25"task 0.0:2" 2584FINISHED 3
>>>> 2680SUCCESS 26"task 0.0:3"26 124FINISHED 42765FAILED 27"task 0.0:4"27
>>>> 108FINISHED 42992SUCCESS 29"task 0.0:4"29 65FINISHED 52865FAILED Mesos
>>>> executor was never created5 3077SUCCESS30 "task 0.0:5"3062 FINISHED6 0
>>>> 53SUCCESS0 "task 0.0:6"041 FINISHED7 177SUCCESS1 "task 0.0:7"1114
>>>> FINISHED...
>>>>
>>>>
>>>> Is it normal that the executor reported in spark and mesos to be
>>>> different when running in fine-grained mode?
>>>>
>>>> Please note that in this particular example the job actually succeeded,
>>>> but most of the time it's failing after 4 failed attempts of a given task.
>>>> This job never fails in coarse-grained mode. Every job is working in
>>>> coarse-grained mode and failing the same way in fine-grained mode.
>>>>
>>>> Does anybody have an idea what the problem could be?
>>>>
>>>> Thanks,
>>>>
>>>> - Sebastien
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Problems running Spark job on mesos in fine-grained mode

Reply via email to