Hi TD, Thanks!
So our application does turn on checkpoints but we do not recover upon application restart (we just blow the checkpoint directory away first and re-create the StreamingContext) as we don't have a real need for that type of recovery. However, because the application does reduceeByKeyAndWindow operations, checkpointing has to be turned on. Do you think this scenario will also only work with HDFS or having local directories suffice? Thanks Nikunj On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote: > Shuffle spills will use local disk, HDFS not needed. > Spark and Spark Streaming checkpoint info WILL NEED HDFS for > fault-tolerance. So that stuff can be recovered even if the spark cluster > nodes go down. > > TD > > On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote: > >> Hello, >> >> We have a Spark Streaming program that is currently running on a single >> node in "local[n]" master mode. We currently give it local directories for >> Spark's own state management etc. The input is streaming from network/flume >> and output is also to network/kafka etc, so the process as such does not >> need any distributed file system. >> >> Now, we do want to start distributing this procesing across a few >> machines and make a real cluster out of it. However, I am not sure if HDFS >> is a hard requirement for that to happen. I am thinking about the Shuffle >> spills, DStream/RDD persistence and checkpoint info. Do any of these >> require the state to be shared via HDFS? Are there other alternatives that >> can be utilized if state sharing is accomplished via the file system only. >> >> Thanks >> Nikunj >> >> >