Hi, > We wonder if this is expected behavior or not? I think it's expected. You could find more information in document [1]. Checkpoints and Savepoints differ in their implementation. Checkpoints are designed to be lightweight and fast. They might (but don’t necessarily have to) make use of different features of the underlying state backend and try to restore data as fast as possible. As an example, incremental Checkpoints with the RocksDB State backend use RocksDB’s internal format instead of Flink’s native format. This is used to speed up the checkpointing process of RocksDB that makes them the first instance of a more lightweight Checkpointing mechanism. On the contrary, Savepoints are designed to focus more on the portability of data and support any changes made to the job that make them slightly more expensive to produce and restore. Besides, Savepoints binary format is different from checkpoint format. Flink savepoint binary format is unified across all state backends. [2] That means you can take a Savepoint with one state backend and then restore it using another. So when restore from Savepoint file, the job need to read from unified binary format and write into format based on the underlying state backend. When restore from checkpoint file, this step maybe easier, for example, load the files directly into underlying state backend. [1] https://www.ververica.com/blog/differences-between-savepoints-and-checkpoints-in-flink [2] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/savepoints/#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint
Best, JING ZHANG [2] Roman Khachatryan <ro...@apache.org> 于2021年10月25日周一 下午4:53写道: > Hi ChangZhuo, > > Yes, restoring from a savepoint is expected to be significantly slower > from a checkpoint. > > Regards, > Roman > > On Mon, Oct 25, 2021 at 9:45 AM ChangZhuo Chen (陳昌倬) <czc...@czchen.org> > wrote: > > > > Hi, > > > > We found that our application savepoint restoration time (~ 40 mins) is > > much slower than checkpoint restoration time (~ 4 mins). We wonder if > > this is expected behavior or not? > > > > > > Some detail about the environment: > > > > * Flink version: 1.14.0 > > * Persistent storage is GCS, via the following jars: > > * flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar > > * gcs-connector-hadoop3-2.2.2-shaded.jar > > * Unaligned checkpoint is enabled. > > * The network ingress for checkpoint restoration (~ 750 MiB/s) is much > > faster than savepoint restoration (~ 50 MiB/s) > > * Checkpoint and savepoint uses different GCS buckets, not sure if this > > will affect the throughput of GCS. > > > > > > -- > > ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org > > http://czchen.info/ > > Key fingerprint = BA04 346D C2E1 FE63 C790 8793 CC65 B0CD EC27 5D5B >