[ https://issues.apache.org/jira/browse/FLINK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yun Tang closed FLINK-19293. ---------------------------- Resolution: Information Provided > RocksDB last_checkpoint.state_size grows endlessly until savepoint/restore > -------------------------------------------------------------------------- > > Key: FLINK-19293 > URL: https://issues.apache.org/jira/browse/FLINK-19293 > Project: Flink > Issue Type: Bug > Components: Library / CEP, Runtime / Checkpointing, Runtime / State > Backends > Affects Versions: 1.10.1 > Reporter: Thomas Wozniakowski > Priority: Major > Attachments: Screenshot 2020-09-18 at 13.58.30.png > > > Hi Guys, > I am seeing some strange behaviour that may be a bug, or may just be intended. > We are running a Flink job on a 1.10.1 cluster running with 1 JobManager and > 2 TaskManagers, parallelism 4. The job itself is simple: > # Source: kinesis connector reading from a single shard stream > # CEP: ~25 CEP Keyed Pattern operators watching the event stream for > different kinds of behaviour. They all have ".withinSeconds(xxxx)" applied. > Nothing is set up to grow endlessly. > # Sink: Single operator writing messages to SQS (custom code) > We are seeing the checkpoint size grow constantly until the job is restarted > using a savepoint/restore. The size continues to grow past the point that the > ".withinSeconds(xxxx)" limits should cause old data to be discarded. The > growth is also out of proportion to the general platform growth (which is > actually trending down at the moment due to COVID). > I've attached a snapshot from our monitoring dashboard below. You can see the > huge drops in state_size on a savepoint/restore. > Our state configuration is as follows: > Backend: RocksDB > Mode: EXACTLY_ONCE > Max Concurrent: 1 > Externalised Checkpoints: RETAIN_ON_CANCELLATION > Async: TRUE > Incremental: TRUE > TTL Compaction Filter enabled: TRUE > We are worried that the CEP library may be leaking state somewhere, leaving > some objects not cleaned up. Unfortunately I can't share one of these > checkpoints with the community due to the sensitive nature of the data > contained within, but if anyone has any suggestions for how I could analyse > the checkpoints to look for leaks, please let me know. > Thanks in advance for the help -- This message was sent by Atlassian Jira (v8.3.4#803005)