[
https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756354#comment-16756354
]
Raghu Angadi commented on BEAM-2185:
------------------------------------
[~JozoVilcek], actually, I don't think 'commitOffsetsInFinalize()' makes sense
for batch processing. Batch processing expects rerunning a task does not affect
the end result (e.g. speculative execution in Hadoop MapReduce). If the user
commits offsets the parallel run of the same task might read different set of
records.
This jira is open precisely to design this aspect of the bounded source : How
to set start and end points of each split so that it is repeatable.
Hope this helps.
> KafkaIO bounded source
> ----------------------
>
> Key: BEAM-2185
> URL: https://issues.apache.org/jira/browse/BEAM-2185
> Project: Beam
> Issue Type: New Feature
> Components: io-java-kafka
> Reporter: Raghu Angadi
> Priority: Major
>
> KafkaIO could be a useful source for batch applications as well. It could
> implement a bounded source. The primary question is how the bounds are
> specified.
> One option : Source specifies a time period (say 9am-10am), and KafkaIO
> fetches appropriate start and end offsets based on time-index in Kafka. This
> would suite many batch applications that are launched on a scheduled.
> Another option is to always read till the end and commit the offsets to
> Kafka. Handling failures and multiple runs of a task might be complicated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)