Re: Savepoints with bootstraping a datastream function

Arvid Heise Wed, 07 Jul 2021 13:33:34 -0700

I don't quite understand your question. You use Savepoint API to create a
savepoint with a batch job (that's why it's DataSet Transform currently).
That savepoint can only be restored through a datastream application.
Dataset applications cannot start from a savepoint.


So I don't understand why you see a difference between "restoring a
savepoint to a datastream" and "create a NewSavepoint for a datastream".
It's ultimately the same thing for me. Just to be very clear: the main
purpose of Savepoint API is to create the initial state of a datastream
application.

For your second question, yes retained checkpoints outlive the job in all
regards. It's the users responsibility to eventually clean that up.



On Wed, Jul 7, 2021 at 6:56 PM Rakshit Ramesh <rakshit.ram...@datakaveri.org>
wrote:

> Yes I could understand restoring a savepoint to a datastream.
> What I couldn't figure out is to create a NewSavepoint for a datastream.
> What I understand is that NewSavepoints only take in Bootstrap
> transformation for Dataset Transform functions.
>
>
> About the checkpoints, does
>  CheckpointConfig.ExternalizedCheckpointCleanup = RETAIN_ON_CANCELLATION
> offer the same behaviour when the job is "FINISHED" and not "CANCELLED" ?
>
> What I'm looking for is a way to retain the state for a bounded job so
> that the state is reloaded on the next job run (through api).
>
> On Wed, 7 Jul 2021 at 14:18, Arvid Heise <ar...@apache.org> wrote:
>
>> Hi Rakshit,
>>
>> The example is valid. The state processor API is kinda working like a
>> DataSet application but the state is meant to be read in DataStream. Please
>> check out the SavepointWriterITCase [1] for a full example. There is no
>> checkpoint/savepoint in DataSet applications.
>>
>> Checkpoints can be stored on different checkpoint storages, such as S3 or
>> HDFS. If you use RocksDB state backend, Flink pretty much just copy the SST
>> files of RocksDB to S3. Checkpoints are usually bound to the life of an
>> application. So they are created by the application and deleted on
>> termination.
>> However, you can resume an application both from savepoint and
>> checkpoints. Checkpoints can be retained [2] to avoid them being deleted by
>> the application during termination. But that's considered an advanced
>> feature and you should first try it with savepoints.
>>
>> [1]
>> https://github.com/apache/flink/blob/release-1.13.0/flink-libraries/flink-state-processing-api/src/test/java/org/apache/flink/state/api/SavepointWriterITCase.java#L141-L141
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#retained-checkpoints
>>
>> On Mon, Jul 5, 2021 at 5:56 PM Rakshit Ramesh <
>> rakshit.ram...@datakaveri.org> wrote:
>>
>>> I'm trying to bootstrap state into a KeyedProcessFunction equivalent
>>> that takes in
>>> a DataStream but I'm unable to find a reference for the same.
>>> I found this gist
>>> https://gist.github.com/alpinegizmo/ff3d2e748287853c88f21259830b29cf
>>> But it seems to only apply for DataSet.
>>> My usecase is to manually trigger a Savepoint into s3 for later reuse.
>>> I'm also guessing that checkpoints can't be stored in rocksdb or s3 for
>>> later reuse.
>>>
>>

Re: Savepoints with bootstraping a datastream function

Reply via email to