Hello Flink Community ,
We are running Jobs in flink version 1.12.7 which reads from Kafka , apply some rules(stored in broadcast state) and then writes to kafka. This is a very low latency and high throughput and we have set up at least one semantics. Checkpoint Configuration Used 1. We cannot have many duplicates during the restarts so we have set a checkpoint interval of 3s. (We cannot increase it any more since , we have 10s of 1000s of records processed per sec ) . 2. Checkpointing target location is AWS S3. 3. Max Concurrent Checkpoint is 1 4. Time Between Checkpoints is 500ms Earlier we had around 10 rule objects stored in broadcast state. Recently we have enabled 80 rule objects. Post increase , we are seeing a lot of checkpoints in progress . (Earlier we had rarely seen this in metrics dashboard). The Parallelism of BroadCast Function is around 10 and the present Checkpoint size is 64kb. Since we expect this rule objects to increase to 1000 and beyond in a year's time, we are looking at ways to improve performance in checkpoint. We cannot use incremental checkpoint since its supported only in RocksDB and the development arc is little higher. Looking at easier solution first , we tried using "SnapshotCompression" , but we did not see any difference in decrease of checkpoint size. Have few questions on the same 1. Does SnapshotCompression work in version 1.12.7 ? 2. If Yes , how much size reduction could we expect if this is enabled and at what size does the Compression works . Is there any threshold post only which the compression would work ? Apart from the questions above , you are welcome to suggest any config changes that can be done for improvements. Thanks & Regards, Prasanna