Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see tons of old shuffle_*.data and *.index files that are never deleted.
How do we get Spark to remove these files?

We have a simple standalone app with one RabbitMQ receiver and a two node
cluster (2 x r3large AWS instances).
Batch interval is 10 minutes after which we process data and write results
to DB. No windowing or state mgmt is used.

I've poured over the documentation and tried setting the following
properties but they have not helped.
As a work around we're using a cron script that periodically cleans up old
files but this has a bad smell to it.

SPARK_WORKER_OPTS in spark-env.sh on every worker node
  spark.worker.cleanup.enabled true
  spark.worker.cleanup.interval
  spark.worker.cleanup.appDataTtl

Also tried on the driver side:
  spark.cleaner.ttl
  spark.shuffle.consolidateFiles true



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to