Hi TD, That may very well have been the case. There may be some delay on our output side. I have made a change just for testing that sends the output nowhere. I will see if that helps get rid of these errors. Then we can try to find out how we can optimize so that we do not lag.
Questions: How can we ever be sure that a lag even if temporary never occur in the future? Also, should Spark not clean up any temp files that it still knows it might need in the (near?) future. Thanks Nikunj On Thu, Apr 23, 2015 at 12:29 PM, Tathagata Das <t...@databricks.com> wrote: > What was the state of your streaming application? Was it falling behind > with a large increasing scheduling delay? > > TD > > On Thu, Apr 23, 2015 at 11:31 AM, N B <nb.nos...@gmail.com> wrote: > >> Thanks for the response, Conor. I tried with those settings and for a >> while it seemed like it was cleaning up shuffle files after itself. >> However, after exactly 5 hours later it started throwing exceptions and >> eventually stopped working again. A sample stack trace is below. What is >> curious about 5 hours is that I set the cleaner ttl to 5 hours after >> changing the max window size to 1 hour (down from 6 hours in order to >> test). It also stopped cleaning the shuffle files after this started >> happening. >> >> Any idea why this could be happening? >> >> 2015-04-22 17:39:52,040 ERROR Executor task launch worker-989 >> Executor.logError - Exception in task 0.0 in stage 215425.0 (TID 425147) >> java.lang.Exception: Could not compute split, block input-0-1429706099000 >> not found >> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> Thanks >> NB >> >> >> On Tue, Apr 21, 2015 at 5:14 AM, Conor Fennell < >> conor.fenn...@altocloud.com> wrote: >> >>> Hi, >>> >>> >>> We set the spark.cleaner.ttl to some reasonable time and also >>> set spark.streaming.unpersist=true. >>> >>> >>> Those together cleaned up the shuffle files for us. >>> >>> >>> -Conor >>> >>> On Tue, Apr 21, 2015 at 8:18 AM, N B <nb.nos...@gmail.com> wrote: >>> >>>> We already do have a cron job in place to clean just the shuffle files. >>>> However, what I would really like to know is whether there is a "proper" >>>> way of telling spark to clean up these files once its done with them? >>>> >>>> Thanks >>>> NB >>>> >>>> >>>> On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele < >>>> gangele...@gmail.com> wrote: >>>> >>>>> Write a crone job for this like below >>>>> >>>>> 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ >>>>> 32 * * * * find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune >>>>> -exec rm -rf {} \+ >>>>> 52 * * * * find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d >>>>> -cmin +1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+ >>>>> >>>>> >>>>> On 20 April 2015 at 23:12, N B <nb.nos...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I had posed this query as part of a different thread but did not get >>>>>> a response there. So creating a new thread hoping to catch someone's >>>>>> attention. >>>>>> >>>>>> We are experiencing this issue of shuffle files being left behind and >>>>>> not being cleaned up by Spark. Since this is a Spark streaming >>>>>> application, >>>>>> it is expected to stay up indefinitely, so shuffle files not being >>>>>> cleaned >>>>>> up is a big problem right now. Our max window size is 6 hours, so we have >>>>>> set up a cron job to clean up shuffle files older than 12 hours otherwise >>>>>> it will eat up all our disk space. >>>>>> >>>>>> Please see the following. It seems the non-cleaning of shuffle files >>>>>> is being documented in 1.3.1. >>>>>> >>>>>> https://github.com/apache/spark/pull/5074/files >>>>>> https://issues.apache.org/jira/browse/SPARK-5836 >>>>>> >>>>>> >>>>>> Also, for some reason, the following JIRAs that were reported as >>>>>> functional issues were closed as Duplicates of the above Documentation >>>>>> bug. >>>>>> Does this mean that this issue won't be tackled at all? >>>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-3563 >>>>>> https://issues.apache.org/jira/browse/SPARK-4796 >>>>>> https://issues.apache.org/jira/browse/SPARK-6011 >>>>>> >>>>>> Any further insight into whether this is being looked into and >>>>>> meanwhile how to handle shuffle files will be greatly appreciated. >>>>>> >>>>>> Thanks >>>>>> NB >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >