Your observation is correct! The current implementation of checkpointing to
DynamoDB is tied to the presence of new data from Kinesis (I think that
emulates the KCL behavior), if there is no data for while, the
checkpointing does not occur. That explains your observation.

I have filed a JIRA to fix this -
https://issues.apache.org/jira/browse/SPARK-11359
Should be available in 1.6


On Tue, Oct 27, 2015 at 4:09 PM, Hster Geguri <hster.investiga...@gmail.com>
wrote:

> We are using Kinesis with Spark Streaming 1.5 on a YARN cluster.  When we
> enable checkpointing in Spark, where in the Kinesis stream should a
> restarted driver continue? I run a simple experiment as follows:
>
> 1. In the first driver run, Spark driver processes 1 million records
> starting from InitialPositionInStream.TRIM_HORIZON  in 5 second batch
> intervals with 10 seconds set as the Kinesis receiver checkpoint interval.
> (This interval has been purposely set low to see the impact of where a
> restarted driver would pick up. )
>
> 2. We stop pushing events to Kinesis stream until the driver keeps pulling
> zero events for a few minutes. Then first driver killed manually through
> "yarn application --kill".
>
> 3. The driver is relaunched a second time and the logs show it
> successfully restored from the DFS checkpoint directory. Because the first
> driver had completely processed all the entries in the stream, I would
> expect the second driver to pick up at the end of the stream or at minimum
> the last 10 second interval window. However the second driver launch (and
> subsequent driver launches)  re-processes about 30 seconds worth of
> (100,000) events and appears not to be related to the Kinesis checkpoint
> interval.
>
> Also with a Kinesis driver, does it make sense you would use Write Ahead
> Logs and incur the cost of writing to DFS when you could remember the
> previous to last checkpoint and just reprocess/refetch directly from the
> stream?
>
> Any input is highly appreciated.
>
> Thanks,
> Heji
>

Reply via email to