Hi Josh, Checkpoints that take longer than the checkpoint interval should not be an issue (if you use an up-to-date version of Flink). The checkpoint coordinator will not issue another checkpoint while another one is still ongoing. Is there maybe some additional data for the crashes? A log perhaps?
Regarding upload speed, yes, each instance of an operator is responsible for uploading its state so if state is equally distributed between operators on TaskManagers that would mean that each TaskManager would upload roughly the same amount of state. It might be interesting to see what the raw upload speed is when you have those to VMs upload to S3, if it is a lot larger than the speed you're seeing something would be wrong and we should investigate. One last thing: are you using the "fully async" mode of RocksDB? I think I remember that you do, just checking. If it is indeed a problem of upload speed to S3 per machine then yes, using more instances should speed up checkpointing. About incremental checkpoints: they're not going to make it into 1.2 with the current planning but after that, I don't know yet. Cheers, Aljoscha On Mon, 24 Oct 2016 at 19:06 Josh <jof...@gmail.com> wrote: Hi all, I'm running Flink on EMR/YARN with 2x m3.xlarge instances and am checkpointing a fairly large RocksDB state to S3. I've found that when the state size hits 10GB, the checkpoint takes around 6 minutes, according to the Flink dashboard. Originally my checkpoint interval was 5 minutes for the job, but I've found that the YARN container crashes (I guess because the checkpoint time is greater than the checkpoint interval), so have now decreased the checkpoint frequency to every 10 minutes. I was just wondering if anyone has any tips about how to reduce the checkpoint time. Taking 6 minutes to checkpoint ~10GB state means it's uploading at ~30MB/sec. I believe the m3.xlarge instances should have around 125MB/sec network bandwidth each, so I think the bottleneck is S3. Since there are 2 instances, I'm not sure if that means each instance is uploading at 15MB/sec - do the state uploads get shared equally among the instances, assuming the state is split equally between the task managers? If the state upload is split between the instances, perhaps the only way to speed up the checkpoints is to add more instances and task managers, and split the state equally among the task managers? Also just wondering - is there any chance the incremental checkpoints work will be complete any time soon? Thanks, Josh