Re: Time different between checkpoint and savepoint restoration in GCS

JING ZHANG Mon, 25 Oct 2021 02:19:50 -0700

Hi,
> We wonder if this is expected behavior or not?
I think it's expected. You could find more information in document [1].
Checkpoints and Savepoints differ in their implementation. Checkpoints are
designed to be lightweight and fast. They might (but don’t necessarily have
to) make use of different features of the underlying state backend and try
to restore data as fast as possible. As an example, incremental Checkpoints
with the RocksDB State backend use RocksDB’s internal format instead of
Flink’s native format. This is used to speed up the checkpointing process
of RocksDB that makes them the first instance of a more lightweight
Checkpointing mechanism. On the contrary, Savepoints are designed to focus
more on the portability of data and support any changes made to the job
that make them slightly more expensive to produce and restore.
Besides, Savepoints binary format is different from checkpoint format.
Flink savepoint binary format is unified across all state backends. [2]
That means you can take a Savepoint with one state backend and then restore
it using another. So when restore from Savepoint file, the job need to read
from unified binary format and write into format based on the underlying
state backend. When restore from checkpoint file, this step maybe easier,
for example, load the files directly into underlying state backend.
[1]
https://www.ververica.com/blog/differences-between-savepoints-and-checkpoints-in-flink
[2]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/savepoints/#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint


Best,
JING ZHANG
[2]

Roman Khachatryan <ro...@apache.org> 于2021年10月25日周一 下午4:53写道：

> Hi ChangZhuo,
>
> Yes, restoring from a savepoint is expected to be significantly slower
> from a checkpoint.
>
> Regards,
> Roman
>
> On Mon, Oct 25, 2021 at 9:45 AM ChangZhuo Chen (陳昌倬) <czc...@czchen.org>
> wrote:
> >
> > Hi,
> >
> > We found that our application savepoint restoration time (~ 40 mins) is
> > much slower than checkpoint restoration time (~ 4 mins). We wonder if
> > this is expected behavior or not?
> >
> >
> > Some detail about the environment:
> >
> > * Flink version: 1.14.0
> > * Persistent storage is GCS, via the following jars:
> >     * flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar
> >     * gcs-connector-hadoop3-2.2.2-shaded.jar
> > * Unaligned checkpoint is enabled.
> > * The network ingress for checkpoint restoration (~ 750 MiB/s) is much
> >   faster than savepoint restoration (~ 50 MiB/s)
> > * Checkpoint and savepoint uses different GCS buckets, not sure if this
> >   will affect the throughput of GCS.
> >
> >
> > --
> > ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
> > http://czchen.info/
> > Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>

Re: Time different between checkpoint and savepoint restoration in GCS

Reply via email to