Hello - 

I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here for
the great product!

In my use case, I am running spark jobs to extract data from some raw data.
Generally this works quite well. 

However, I am noticing that for certain data sets there are certain tasks
that are extremely long running -- i.e. 8-12x longer than the normal task. I
don't actually need the data from these extremely long running tasks -- so I
am writing today to ask is there a way to kill certain tasks that take
significantly more time and just accept that no data will be found from
them?

I have read:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html
and know about spark.speculation -- however, I think my use case is
different in that I don't want the tasks re-started -- I just want to accept
that oh this task is too long running -- let's kill it and move on.  

So in effect, I'd like to timeout the task, but still collect the data from
the remaining tasks. 

Does anyone have any advice on how I can timeout / kill these stragglers --
and keep the remaining data?

Thanks!


- Bill 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Killing-Long-running-tasks-stragglers-tp23485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to