Hi Oscar, > but we don't understand why this incremental checkpoint keeps increasing
AFAIK, when performing incremental checkpoint, the RocksDBStateBackend will upload the new created SST files to remote storage. The total size of these files is the incremental checkpoint size. However, the new created SST files generated by RocksDB's compaction behavior are not entirely decided by the the new data ingested into the state. RocksDB stores data as a LSM Tree which has spatial magnification, the size of these files generated by compaction are affected by different compaction strategy and might be proportional to the overall size of the LSM Tree. Hope this solves yours doubts. Xiangyu Feng Oscar Perez via user <user@flink.apache.org> 于2023年11月27日周一 19:55写道: > Hi, > > We have a long running job in production and we are trying to understand > the metrics for this job, see attached screenshot. > > We have enabled incremental checkpoint for this job and we use RocksDB as > a state backend. > > When deployed from fresh state, the initial checkpoint size is about* > 2.41G*. I guess most of the contents come from table API and reading a > bunch of topics from earliest. > > Few data regarding the graph: > > Full checkpoint size spans from* 2.41G *in September 21st until *11.6G* > in November 14th > The last checkpoint size (incremental checkpoint) goes from *232Mb* in > september 21st until *2.35Gb *on November 14th. We take incremental > checkpoints every 30 seconds > Time that it takes to take a checkpoint goes from *1.66seconds *on > September 21st until *10.85 seconds* on November 14th > > Few things we dont understand > > Why incremental checkpoint size keeps increasing? My assumption would be > that deltas are lineal and the incremental checkpoint size would remain > around 230Mb but it keeps increasing over time until it reaches to 2.35Gb ! > > Full checkpoint size does not completely make sense. If each incremental > checkpoint size keeps increasing linearly, I would expect the full > checkpoint size to increase way way faster as the full checkpoint size is > the sum of all incremental checkpoints and we take an incremental > checkpoint every 30 seconds. > > Time that takes to take a checkpoint correlates with the incremental > checkpoint sizes. The bigger the incremental checkpoint size, the longer it > takes to store it but we dont understand why this incremental checkpoint > keeps increasing. Is this something related to table API internals ? > > Thanks for any help that could be given! > Regards, > Oscar > > > > > >