Hi Congxian, Thank you so much for your response, that really helps! >From your experience, how long does it take for Flink to redistribute terabytes of state data on node addition / node failure.
Thanks! On Sun, May 3, 2020 at 6:56 PM Congxian Qiu <qcx978132...@gmail.com> wrote: > Hi > > 1. From my experience, Flink can support such big state, you can set > appropriate parallelism for the stateful operator. for RocksDB you may need > to care about the disk performance. > 2. Inside Flink, the state is separated by key-group, each > task/parallelism contains multiple key-groups. Flink does not need to > restart when you add a node to the cluster, but every time restart from > savepoint/checkpoint(or failover), Flink needs to redistribute the > checkpoint data, this can be omitted if it's failover and local recovery[1] > is enabled > 3. for upload/download state, you can ref to the multiple thread > upload/download[2][3] for speed up them > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/large_state_tuning.html#task-local-recovery > [2] https://issues.apache.org/jira/browse/FLINK-10461 > [3] https://issues.apache.org/jira/browse/FLINK-11008 > > Best, > Congxian > > > Gowri Sundaram <gowripsunda...@gmail.com> 于2020年5月1日周五 下午6:29写道: > >> Hello all, >> We have read in multiple >> <https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html> >> sources <https://flink.apache.org/usecases.html> that Flink has been >> used for use cases with terabytes of application state. >> >> We are considering using Flink for a similar use case with* keyed state >> in the range of 20 to 30 TB*. We had a few questions regarding the same. >> >> >> - *Is Flink a good option for this kind of scale of data* ? We are >> considering using RocksDB as the state backend. >> - *What happens when we want to add a node to the cluster *? >> - As per our understanding, if we have 10 nodes in our cluster, >> with 20TB of state, this means that adding a node would require the >> entire >> 20TB of data to be shipped again from the external checkpoint remote >> storage to the taskmanager nodes. >> - Assuming 1Gb/s network speed, and assuming all nodes can read >> their respective 2TB state parallely, this would mean a *minimum >> downtime of half an hour*. And this is assuming the throughput of >> the remote storage does not become the bottleneck. >> - Is there any way to reduce this estimated downtime ? >> >> >> Thank you! >> >