I understand flink does steaming, but i feel my requirement is more batch
oriented.

Read froma kafka cluster,
Do a little data massaging
Bucket data into hadoop files that are atleast one hdfs block in size.
Our environment is Yarn and kerberized (kafka and hadoop, i am currently
allowed pass the keytab to the containers)
Signal down stream processing based on timestamps in data and kafka offsets
The process has to run forever

I used a flink streaming job with checkpoints, with windowing (custom
trigger based on both a count of events or time inspired by
https://gist.github.com/shikhar/2cb9f1b792be31b7c16e/9f08f8964b3f177fe48f6f8fc3916932fdfdcc71
)


Obeservations with Streaming.

1) Long running kerberos fails in 7 days (the data that is held in the
window buffer is lost and restart results in event loss)
2) I hold on to the resouces/container in the cluster irrespective of
volume of events for all time
3) The comitted offsets in kafka does not reflect the last written offsets
in hdfs (kafka offsets may be commited/checkpointed while the window is yet
to trigger)
4) Windowing is similar to batch in a way, the files are not
available/rolled till the file is closed

Is there a way the kafkaconnector can take a start and stop values for
offsets that would be ideal for my scenario. The design in this scenario
would be to

1) run a scheduled job that calculates offset ranges to be consumed
2) The container number and size would be based off the number of messages
consumed
3) commit the offsets after job is successful

Please let me know if there is a better way.

Thanks,
Prabhu

Reply via email to