Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down.
TD On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote: > Hello, > > We have a Spark Streaming program that is currently running on a single > node in "local[n]" master mode. We currently give it local directories for > Spark's own state management etc. The input is streaming from network/flume > and output is also to network/kafka etc, so the process as such does not > need any distributed file system. > > Now, we do want to start distributing this procesing across a few machines > and make a real cluster out of it. However, I am not sure if HDFS is a hard > requirement for that to happen. I am thinking about the Shuffle spills, > DStream/RDD persistence and checkpoint info. Do any of these require the > state to be shared via HDFS? Are there other alternatives that can be > utilized if state sharing is accomplished via the file system only. > > Thanks > Nikunj > >