Re: Errors checkpointing to S3 for high-scale jobs

2018-05-16 Thread Stephan Ewen
For posterity: Here is the Jira Issue that tracks this: https://issues.apache.org/jira/browse/FLINK-9061 On Thu, Mar 22, 2018 at 11:46 PM, Jamie Grier wrote: > I think we need to modify the way we write checkpoints to S3 for high-scale > jobs (those with many total tasks). The issue is that we

Errors checkpointing to S3 for high-scale jobs

2018-03-22 Thread Jamie Grier
I think we need to modify the way we write checkpoints to S3 for high-scale jobs (those with many total tasks). The issue is that we are writing all the checkpoint data under a common key prefix. This is the worst case scenario for S3 performance since the key is used as a partition key. In the