Hi, I've been doing some prototyping on Kafka for a few months now and like what I see. It's a good fit for some of my use cases in the areas of data distribution but also for processing - liking a lot of what I see in Samza. I'm now working through some of the operational issues and have a question to the community.
I have several data sources that I want to push into Kafka but some of the most important are arriving as a stream of files being dropped either into a SFTP location or S3. Conceptually the data is really a stream but its being chunked and made more batch by the deployment model of the operational servers. So pulling the data into Kafka and seeing it more as a stream again is a big plus. But, I really don't want duplicate messages. I know Kafka provides at least once semantics and that's fine, I'm happy to have the de-dupe logic external to Kafka. And if I look at my producer I can build up a protocol around adding record metadata and using Zookeeper to give me pretty high confidence that my clients will know if they are reading from a file that was fully published into Kafka or not. I had assumed that this wouldn't be a unique use case but on doing a bunch of searches I really don't find much in terms of either tools that help or even just best practice patterns for handling this type of need to support exactly-once message processing. So now I'm thinking that either I just need better web search skills or that actually this isn't something many others are doing and if so then there's likely a reason for that. Any thoughts? Thanks Garry