We are prototyping an application with Spark streaming and Kinesis. We use
kinesis to accept incoming txn data, and then process them using spark
streaming. So far we really liked both technologies, and we saw both
technologies are getting mature rapidly. We are almost settled to use these
two technologies, but we are a little scary by the paragraph in the
programming guide.

"For network-based data sources like Kafka and Flume, the received input
data is replicated in memory between nodes of the cluster (default
replication factor is 2). So if a worker node fails, then the system can
recompute the lost from the the left over copy of the input data. However,
if the worker node where a network receiver was running fails, then a tiny
bit of data may be lost, that is, the data received by the system but not
yet replicated to other node(s). The receiver will be started on a
different node and it will continue to receive data."

Since our application cannot tolerate losing customer data, I am wondering
what is the best way for us to address this issue.
1) We are thinking writing application specific logic to address the data
loss. To us, the problem seems to be caused by that Kinesis receivers
advanced their checkpoint before we know for sure the data is replicated.
For example, we can do another checkpoint ourselves to remember the kinesis
sequence number for data that has been processed by spark streaming. When
Kinesis receiver is restarted due to worker failures, we restarted it from
the checkpoint we tracked. We also worry about our driver program (or the
whole cluster) dies because of a bug in the application, the above logic
will allow us to resume from our last checkpoint.

Is there any best practices out there for this issue? I suppose many folks
are using spark streaming with network receivers, any suggestion is
welcomed.
2) Write kinesis data to s3 first, then either use it as a backup or read
from s3 in spark streaming. This is the safest approach but with a
performance/latency penalty. On the other hand,  we may have to write data
to s3 anyway since Kinesis only stores up to 24 hours data just in case we
had a bad day in our server infrastructure.
3) Wait for this issue to be addressed in spark streaming. I found this
ticket https://issues.apache.org/jira/browse/SPARK-1647, but it is not
resolved yet.

Thanks,
Wei

Reply via email to