Hello - I am using Apache Spark 1.2.1 via pyspark. Thanks to any developers here for the great product!
In my use case, I am running spark jobs to extract data from some raw data. Generally this works quite well. However, I am noticing that for certain data sets there are certain tasks that are extremely long running -- i.e. 8-12x longer than the normal task. I don't actually need the data from these extremely long running tasks -- so I am writing today to ask is there a way to kill certain tasks that take significantly more time and just accept that no data will be found from them? I have read: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html and know about spark.speculation -- however, I think my use case is different in that I don't want the tasks re-started -- I just want to accept that oh this task is too long running -- let's kill it and move on. So in effect, I'd like to timeout the task, but still collect the data from the remaining tasks. Does anyone have any advice on how I can timeout / kill these stragglers -- and keep the remaining data? Thanks! - Bill