Re: How to start spark streaming application with recent past timestamp for replay of old batches?

Akhil Das Sun, 21 Feb 2016 23:54:55 -0800

On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:


> Hi Folks,
>
>
>
> I am exploring spark for streaming from two sources (a) Kinesis and (b)
> HDFS for some of our use-cases. Since we maintain state gathered over last
> x hours in spark streaming, we would like to replay the data from last x
> hours as batches during deployment. I have gone through the Spark APIs but
> could not find anything that initiates with older timestamp. Appreciate
> your input on the same.
>
> 1.      Restarting with check-pointing runs the batches faster for missed
> timestamp period, but when we upgrade with new code, the same checkpoint
> directory cannot be reused.
>
=> It is true that you won't be able to use the checkpoint when you
upgrade your code, the production codes are not upgraded every now and
then. You can basically create a configuration file in which you can put
most of then stuffs (like streaming duration, parameters etc) instead of
updating them in the code and breaking the checkpoint. 


> 2.      For the case with kinesis as source, we can change the last
> checked sequence number in DynamoDB to get the data from last x hours, but
> this will be one large bunch of data for first restarted batch. So the data
> is not processed as natural multiple batches inside spark.
>
=> The kinesis API has a way to limit the data rate, you might want to
look into that and implement a custom receiver for your use-case.

http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html

> 3.      For the source from HDFS, I could not find any alternative to
> start the streaming from old timestamp data, unless I manually (or with
> script) rename the old files after starting the stream. (This workaround
> leads to other complications too further).
>
=> What you can do is, once the files are processed, you can move them to
a different directory and when you restart the stream for whatever reason,
you can make it pick all the files instead of the latest ones (by passing
the *newFilesOnly* boolean param)


> 4.      May I know how is the zero data loss is achieved while having
> hdfs as source? i.e. if the driver fails while processing a micro batch,
> what happens when the application is restarted? Is the same micro-batch
> reprocessed?
>
=> Yes, If the application is restarted then the micro-batch will be
reprocessed.


>
>
> Regards
>
> Ashok
>

Re: How to start spark streaming application with recent past timestamp for replay of old batches?

Reply via email to