[
https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754714#comment-16754714
]
Jozef Vilcek commented on BEAM-2185:
------------------------------------
With KafkaIO supporting settings `withMaxRecords()` and
`commitOffsetsInFinalize()` one can be under impression that batch workload is
possible. BEAM-6466 that commit of offsets has still some miss-behaviour.
What I also noticed is a weak guarantees around commit of offsets. Commit is
invoked by calling checkpoint and finalizing it. This is done in ReadFn
processElements after reading from source enough elements and/or for long
enough time here:
[https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedReadFromUnboundedSource.java#L203]
I guess that checkpoint is being made before all elements downstream are
successfully processed. It his is used in batch, on failure and restart this
can loose some data. Right? Is there a way in Beam model to make sure offsets
are committed only after elements are successfully processed? In batch run mode
it probably means pipeline is finished with DONE state?
> KafkaIO bounded source
> ----------------------
>
> Key: BEAM-2185
> URL: https://issues.apache.org/jira/browse/BEAM-2185
> Project: Beam
> Issue Type: New Feature
> Components: io-java-kafka
> Reporter: Raghu Angadi
> Priority: Major
>
> KafkaIO could be a useful source for batch applications as well. It could
> implement a bounded source. The primary question is how the bounds are
> specified.
> One option : Source specifies a time period (say 9am-10am), and KafkaIO
> fetches appropriate start and end offsets based on time-index in Kafka. This
> would suite many batch applications that are launched on a scheduled.
> Another option is to always read till the end and commit the offsets to
> Kafka. Handling failures and multiple runs of a task might be complicated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)