Re: Manage Kafka Offsets for Fault Tolerance Without Checkpointing

Rahul Patwari Mon, 12 Apr 2021 06:17:31 -0700

Hi Roman,

Thanks for the reply.
This is what I meant by Internal restarts - Automatic restore of Flink Job
from a failure. For example, pipeline restarts when Fixed delay
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html#fixed-delay-restart-strategy>
or Failure Rate
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html#failure-rate-restart-strategy>
restart strategies are configured.


Quoting documentation in this link - Configuring Kafka Consumer start
position configuration
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration>


Note that these start position configuration methods do not affect the
> start position when the job is automatically restored from a failure



It seems that there will be data loss even when offsets are managed
externally when there are pipeline restarts due to a failure, say, an
exception. On the other hand, when the pipeline is stopped and
resubmitted(say, an upgrade), there won't be any data loss as offsets are
retrieved from an external store and configured while starting Kafka
Consumer.

We do not want to enable checkpointing as the pipeline is stateless. We
have Deduplication logic in the pipeline and the processing is idempotent.

Other than managing offsets externally, Are there any other ways to
guarantee "at least once" processing without enabling checkpointing?

Thanks,
Rahul

On Mon, Apr 12, 2021 at 6:00 PM Roman Khachatryan <[email protected]> wrote:

> Hi,
>
> Could you please explain what you mean by internal restarts?
>
> If you commit offsets or timestamps from sink after emitting records
> to the external system then there should be no data loss.
> Otherwise (if you commit offsets earlier), you have to persist
> in-flight records to avoid data loss (i.e. enable checkpointing).
>
> Regards,
> Roman
>
> On Mon, Apr 12, 2021 at 12:11 PM Rahul Patwari
> <[email protected]> wrote:
> >
> > Hello,
> >
> > Context:
> >
> > We have a stateless Flink Pipeline which reads from Kafka topics.
> > The pipeline has a Windowing operator(Used only for introducing a delay
> in processing records) and AsyncI/O operators (used for Lookup/Enrichment).
> >
> > "At least Once" Processing semantics is needed for the pipeline to avoid
> data loss.
> >
> > Checkpointing is disabled and we are dependent on the auto offset commit
> of Kafka consumer for fault tolerance currently.
> >
> > As auto offset commit indicates that "the record is successfully read",
> instead of "the record is successfully processed", there will be data loss
> if there is a restart when the offset is committed to Kafka but not
> successfully processed by the Flink Pipeline, as the record is NOT replayed
> again when the pipeline is restarted.
> >
> > Checkpointing can solve this problem. But, since the pipeline is
> stateless, we do not want to use checkpointing, which will persist all the
> records in Windowing Operator and in-flight Async I/O calls.
> >
> > Question:
> >
> > We are looking for other ways to guarantee "at least once" processing
> without checkpointing. One such way is to manage Kafka Offsets Externally.
> >
> > We can maintain offsets of each partition of each topic in Cassandra(or
> maintain timestamp, where all records with timestamps less than this
> timestamp are successfully processed) and configure Kafka consumer Start
> Position - setStartFromTimestamp() or setStartFromSpecificOffsets()
> >
> > This will be helpful if the pipeline is manually restarted (say,
> JobManager pod is restarted). But, how to avoid data loss in case of
> internal restarts?
> >
> > Has anyone used this approach?
> > What are other ways to guarantee "at least once" processing without
> checkpointing for a stateless Flink pipeline?
> >
> > Thanks,
> > Rahul
>

Re: Manage Kafka Offsets for Fault Tolerance Without Checkpointing

Reply via email to