Shuffle spills will use local disk, HDFS not needed.
Spark and Spark Streaming checkpoint info WILL NEED HDFS for
fault-tolerance. So that stuff can be recovered even if the spark cluster
nodes go down.

TD

On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote:

> Hello,
>
> We have a Spark Streaming program that is currently running on a single
> node in "local[n]" master mode. We currently give it local directories for
> Spark's own state management etc. The input is streaming from network/flume
> and output is also to network/kafka etc, so the process as such does not
> need any distributed file system.
>
> Now, we do want to start distributing this procesing across a few machines
> and make a real cluster out of it. However, I am not sure if HDFS is a hard
> requirement for that to happen. I am thinking about the Shuffle spills,
> DStream/RDD persistence and checkpoint info. Do any of these require the
> state to be shared via HDFS? Are there other alternatives that can be
> utilized if state sharing is accomplished via the file system only.
>
> Thanks
> Nikunj
>
>

Reply via email to