I understand flink does steaming, but i feel my requirement is more batch oriented.
Read froma kafka cluster, Do a little data massaging Bucket data into hadoop files that are atleast one hdfs block in size. Our environment is Yarn and kerberized (kafka and hadoop, i am currently allowed pass the keytab to the containers) Signal down stream processing based on timestamps in data and kafka offsets The process has to run forever I used a flink streaming job with checkpoints, with windowing (custom trigger based on both a count of events or time inspired by https://gist.github.com/shikhar/2cb9f1b792be31b7c16e/9f08f8964b3f177fe48f6f8fc3916932fdfdcc71 ) Obeservations with Streaming. 1) Long running kerberos fails in 7 days (the data that is held in the window buffer is lost and restart results in event loss) 2) I hold on to the resouces/container in the cluster irrespective of volume of events for all time 3) The comitted offsets in kafka does not reflect the last written offsets in hdfs (kafka offsets may be commited/checkpointed while the window is yet to trigger) 4) Windowing is similar to batch in a way, the files are not available/rolled till the file is closed Is there a way the kafkaconnector can take a start and stop values for offsets that would be ideal for my scenario. The design in this scenario would be to 1) run a scheduled job that calculates offset ranges to be consumed 2) The container number and size would be based off the number of messages consumed 3) commit the offsets after job is successful Please let me know if there is a better way. Thanks, Prabhu