We are prototyping an application with Spark streaming and Kinesis. We use kinesis to accept incoming txn data, and then process them using spark streaming. So far we really liked both technologies, and we saw both technologies are getting mature rapidly. We are almost settled to use these two technologies, but we are a little scary by the paragraph in the programming guide.
"For network-based data sources like Kafka and Flume, the received input data is replicated in memory between nodes of the cluster (default replication factor is 2). So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the worker node where a network receiver was running fails, then a tiny bit of data may be lost, that is, the data received by the system but not yet replicated to other node(s). The receiver will be started on a different node and it will continue to receive data." Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To us, the problem seems to be caused by that Kinesis receivers advanced their checkpoint before we know for sure the data is replicated. For example, we can do another checkpoint ourselves to remember the kinesis sequence number for data that has been processed by spark streaming. When Kinesis receiver is restarted due to worker failures, we restarted it from the checkpoint we tracked. We also worry about our driver program (or the whole cluster) dies because of a bug in the application, the above logic will allow us to resume from our last checkpoint. Is there any best practices out there for this issue? I suppose many folks are using spark streaming with network receivers, any suggestion is welcomed. 2) Write kinesis data to s3 first, then either use it as a backup or read from s3 in spark streaming. This is the safest approach but with a performance/latency penalty. On the other hand, we may have to write data to s3 anyway since Kinesis only stores up to 24 hours data just in case we had a bad day in our server infrastructure. 3) Wait for this issue to be addressed in spark streaming. I found this ticket https://issues.apache.org/jira/browse/SPARK-1647, but it is not resolved yet. Thanks, Wei