We are using Kinesis with Spark Streaming 1.5 on a YARN cluster. When we enable checkpointing in Spark, where in the Kinesis stream should a restarted driver continue? I run a simple experiment as follows:
1. In the first driver run, Spark driver processes 1 million records starting from InitialPositionInStream.TRIM_HORIZON in 5 second batch intervals with 10 seconds set as the Kinesis receiver checkpoint interval. (This interval has been purposely set low to see the impact of where a restarted driver would pick up. ) 2. We stop pushing events to Kinesis stream until the driver keeps pulling zero events for a few minutes. Then first driver killed manually through "yarn application --kill". 3. The driver is relaunched a second time and the logs show it successfully restored from the DFS checkpoint directory. Because the first driver had completely processed all the entries in the stream, I would expect the second driver to pick up at the end of the stream or at minimum the last 10 second interval window. However the second driver launch (and subsequent driver launches) re-processes about 30 seconds worth of (100,000) events and appears not to be related to the Kinesis checkpoint interval. Also with a Kinesis driver, does it make sense you would use Write Ahead Logs and incur the cost of writing to DFS when you could remember the previous to last checkpoint and just reprocess/refetch directly from the stream? Any input is highly appreciated. Thanks, Heji
