Hey Jacob,

If you understand how the Kafka offset managed in the checkpoint,
then you could map this notion to other Flink sources.

I would suggest to read the Data Sources[1] document and FLIP-27[5].
Each source should define a `Split`, then it is `SourceReaderBase`[2]
class' responsibility to maintain unfinished split states in
checkpointing.

A split could then be anything, for example,
- Filename and offset in that file[3]
- Kafka topic partition and offset[4]
- JDBC query with start and end range

In general each source should define such a split so that on the
failure recovery, source reader would know where to start processing
data.

Hope these are helpful, and hopefully I didn't mix anything up :)

Best,
Muhammet

[1]: https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/sources/ [2]: https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.java#L300-L304 [3]: https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/FileSourceSplit.java [4]: https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/split/KafkaPartitionSplit.java [5]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface

On 2024-05-08 10:56, Jacob Rollings wrote:
Hello,

I'm curious about how Flink checkpointing would aid in recovering data
if the data source is not Kafka but another system. I understand that
checkpoint snapshots are taken at regular time intervals.

What happens to the data that were read after the previous successful
checkpoint if the system crashes before the next checkpointing
interval, especially for data sources that are not Kafka and don't
support offset management?

-
JB
  • Checkpointing Jacob Rollings
    • Re: Checkpointing Muhammet Orazov via user

Reply via email to