[ 
https://issues.apache.org/jira/browse/BEAM-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756354#comment-16756354
 ] 

Raghu Angadi commented on BEAM-2185:
------------------------------------

[~JozoVilcek], actually, I don't think 'commitOffsetsInFinalize()' makes sense 
for batch processing. Batch processing expects rerunning a task does not affect 
the end result (e.g. speculative execution in Hadoop MapReduce). If the user 
commits offsets the parallel run of the same task might read different set of 
records. 

This jira is open precisely to design this aspect of the bounded source : How 
to set start and end points of each split so that it is repeatable. 
Hope this helps.


> KafkaIO bounded source
> ----------------------
>
>                 Key: BEAM-2185
>                 URL: https://issues.apache.org/jira/browse/BEAM-2185
>             Project: Beam
>          Issue Type: New Feature
>          Components: io-java-kafka
>            Reporter: Raghu Angadi
>            Priority: Major
>
> KafkaIO could be a useful source for batch applications as well. It could 
> implement a bounded source. The primary question is how the bounds are 
> specified.
> One option : Source specifies a time period (say 9am-10am), and KafkaIO 
> fetches appropriate start and end offsets based on time-index in Kafka. This 
> would suite many batch applications that are launched on a scheduled.
> Another option is to always read till the end and commit the offsets to 
> Kafka. Handling failures and multiple runs of a task might be complicated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to