Hi,

We have a long running job in production and we are trying to understand
the metrics for this job, see attached screenshot.

We have enabled incremental checkpoint for this job and we use RocksDB as a
state backend.

When deployed from fresh state, the initial checkpoint size is about* 2.41G*.
I guess most of the contents come from table API and reading a bunch of
topics from earliest.

Few data regarding the graph:

Full checkpoint size spans from* 2.41G *in September 21st until *11.6G* in
November 14th
The last checkpoint size (incremental checkpoint) goes from *232Mb* in
september 21st until  *2.35Gb *on November 14th. We take incremental
checkpoints every 30 seconds
Time that it takes to take a checkpoint goes from *1.66seconds *on
September 21st until *10.85 seconds* on November 14th

Few things we dont understand

Why incremental checkpoint size keeps increasing? My assumption would be
that deltas are lineal and the incremental checkpoint size would remain
around 230Mb but it keeps increasing over time until it reaches to 2.35Gb !

Full checkpoint size does not completely make sense. If each incremental
checkpoint size keeps increasing linearly, I would expect the full
checkpoint size to increase way way faster as the full checkpoint size is
the sum of all incremental checkpoints and we take an incremental
checkpoint every 30 seconds.

Time that takes to take a checkpoint correlates with the incremental
checkpoint sizes. The bigger the incremental checkpoint size, the longer it
takes to store it but we dont understand why this incremental checkpoint
keeps increasing. Is this something related to table API internals ?

Thanks for any help that could be given!
Regards,
Oscar

Reply via email to