Hello all,
We have read in multiple
<https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html>
sources <https://flink.apache.org/usecases.html> that Flink has been used
for use cases with terabytes of application state.
We are considering using Flink for a similar use case with* keyed state in
the range of 20 to 30 TB*. We had a few questions regarding the same.
- *Is Flink a good option for this kind of scale of data* ? We are
considering using RocksDB as the state backend.
- *What happens when we want to add a node to the cluster *?
- As per our understanding, if we have 10 nodes in our cluster, with
20TB of state, this means that adding a node would require the
entire 20TB
of data to be shipped again from the external checkpoint remote
storage to
the taskmanager nodes.
- Assuming 1Gb/s network speed, and assuming all nodes can read their
respective 2TB state parallely, this would mean a *minimum downtime
of half an hour*. And this is assuming the throughput of the remote
storage does not become the bottleneck.
- Is there any way to reduce this estimated downtime ?
Thank you!