Hello -

I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here
for the great product!

In my use case, I am running spark jobs to extract data from some raw data.
Generally this works quite well.

However, I am noticing that for certain data sets there are certain tasks
that are extremely long running -- i.e. 8-12x longer than the normal task.
I don't actually need the data from these extremely long running tasks --
so I am writing today to ask is there a way to kill certain tasks that take
significantly more time and just accept that no data will be found from
them?

I have read:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html
and
know about spark.speculation -- however, I think my use case is different
in that I don't want the tasks re-started -- I just want to accept that oh
this task is too long running -- let's kill it and move on.

So in effect, I'd like to timeout the task, but still collect the data from
the remaining tasks.

Does anyone have any advice on how I can timeout / kill these stragglers --
and keep the remaining data?

Thanks!


- Bill

Reply via email to