Hi, We have a long running job in production and we are trying to understand the metrics for this job, see attached screenshot.
We have enabled incremental checkpoint for this job and we use RocksDB as a state backend. When deployed from fresh state, the initial checkpoint size is about* 2.41G*. I guess most of the contents come from table API and reading a bunch of topics from earliest. Few data regarding the graph: Full checkpoint size spans from* 2.41G *in September 21st until *11.6G* in November 14th The last checkpoint size (incremental checkpoint) goes from *232Mb* in september 21st until *2.35Gb *on November 14th. We take incremental checkpoints every 30 seconds Time that it takes to take a checkpoint goes from *1.66seconds *on September 21st until *10.85 seconds* on November 14th Few things we dont understand Why incremental checkpoint size keeps increasing? My assumption would be that deltas are lineal and the incremental checkpoint size would remain around 230Mb but it keeps increasing over time until it reaches to 2.35Gb ! Full checkpoint size does not completely make sense. If each incremental checkpoint size keeps increasing linearly, I would expect the full checkpoint size to increase way way faster as the full checkpoint size is the sum of all incremental checkpoints and we take an incremental checkpoint every 30 seconds. Time that takes to take a checkpoint correlates with the incremental checkpoint sizes. The bigger the incremental checkpoint size, the longer it takes to store it but we dont understand why this incremental checkpoint keeps increasing. Is this something related to table API internals ? Thanks for any help that could be given! Regards, Oscar