Re: Is HDFS required for Spark streaming?

N B Sat, 05 Sep 2015 08:45:32 -0700

Hi TD,

Thanks!

So our application does turn on checkpoints but we do not recover upon
application restart (we just blow the checkpoint directory away first and
re-create the StreamingContext) as we don't have a real need for that type
of recovery. However, because the application does reduceeByKeyAndWindow
operations, checkpointing has to be turned on. Do you think this scenario
will also only work with HDFS or having local directories suffice?

Thanks
Nikunj

On Fri, Sep 4, 2015 at 3:09 PM, Tathagata Das <t...@databricks.com> wrote:

> Shuffle spills will use local disk, HDFS not needed.
> Spark and Spark Streaming checkpoint info WILL NEED HDFS for
> fault-tolerance. So that stuff can be recovered even if the spark cluster
> nodes go down.
>
> TD
>
> On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote:
>
>> Hello,
>>
>> We have a Spark Streaming program that is currently running on a single
>> node in "local[n]" master mode. We currently give it local directories for
>> Spark's own state management etc. The input is streaming from network/flume
>> and output is also to network/kafka etc, so the process as such does not
>> need any distributed file system.
>>
>> Now, we do want to start distributing this procesing across a few
>> machines and make a real cluster out of it. However, I am not sure if HDFS
>> is a hard requirement for that to happen. I am thinking about the Shuffle
>> spills, DStream/RDD persistence and checkpoint info. Do any of these
>> require the state to be shared via HDFS? Are there other alternatives that
>> can be utilized if state sharing is accomplished via the file system only.
>>
>> Thanks
>> Nikunj
>>
>>
>

Re: Is HDFS required for Spark streaming?

Reply via email to