It sounds like you want to use Spark / Spark Streaming to do that kind of batching output.
From: Milind Vaidya <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, May 11, 2016 at 4:24 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Getting Kafka Offset in Storm Bolt Yeah. We have some microbatching in place for other topologies. This one is little ambitious, in the sense each message is 1~2KB in size so grouping them to a reasonable chunk is necessary say 500KB ~ 1 GB (this is just my guess, I am not sure how much does S3 support or what is optimum). Once that chunk is uploaded, all of them can be acked. But isn't it overkill ? I guess storm is not even meant to support that kind of a use case. On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <[email protected]<mailto:[email protected]>> wrote: You can micro batch kafka contents into a file that's replicated (e.g. HDFS) and then ack all of the input tuples after the file has been closed. On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <[email protected]<mailto:[email protected]>> wrote: in case of failure to upload a file or disk corruption leading to loss of file, we have only current offset in Kafka Spout but have no record as to which offsets were lost in the file which need to be replayed. So these can be stored externally in zookeeper and could be used to account for lost data. For them to save in ZK, they should be available in a bolt. On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <[email protected]<mailto:[email protected]>> wrote: Why not just ack the tuple once it's been written to a file. If your topology fails then the data will be re-read from Kafka. Kafka spout already does this for you. Then uploading files to S3 is the responsibility of another job. For example, a storm topology that monitors the output folder. Monitoring the data from Kafka all the way out to S3 seems unnecessary. On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <[email protected]<mailto:[email protected]>> wrote: It does not matter, in the sense I am ready to upgrade if this thing is in the roadmap. None the less kafka_2.9.2-0.8.1.1 apache-storm-0.9.4 On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <[email protected]<mailto:[email protected]>> wrote: which version of storm-kafka, are you using? On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <[email protected]<mailto:[email protected]>> wrote: Anybody ? Anything about this ? On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <[email protected]<mailto:[email protected]>> wrote: Is there any way I can know what Kafka offset corresponds to current tuple I am processing in a bolt ? Use case : Need to batch events from Kafka, persists them to a local file and eventually upload it to the S3. To manager failure cases, need to know the Kafka offset for a message, so that it can be persisted to Zookeeper and will be used to write / upload file. -- Regards, Abhishek Agarwal This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
