Yeah. We have some microbatching in place for other topologies. This one is little ambitious, in the sense each message is 1~2KB in size so grouping them to a reasonable chunk is necessary say 500KB ~ 1 GB (this is just my guess, I am not sure how much does S3 support or what is optimum). Once that chunk is uploaded, all of them can be acked. But isn't it overkill ? I guess storm is not even meant to support that kind of a use case.
On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <[email protected]> wrote: > You can micro batch kafka contents into a file that's replicated (e.g. > HDFS) and then ack all of the input tuples after the file has been closed. > > On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <[email protected]> wrote: > >> in case of failure to upload a file or disk corruption leading to loss of >> file, we have only current offset in Kafka Spout but have no record as to >> which offsets were lost in the file which need to be replayed. So these can >> be stored externally in zookeeper and could be used to account for lost >> data. For them to save in ZK, they should be available in a bolt. >> >> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <[email protected]> wrote: >> >>> Why not just ack the tuple once it's been written to a file. If your >>> topology fails then the data will be re-read from Kafka. Kafka spout >>> already does this for you. Then uploading files to S3 is the >>> responsibility of another job. For example, a storm topology that monitors >>> the output folder. >>> >>> Monitoring the data from Kafka all the way out to S3 seems unnecessary. >>> >>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <[email protected]> >>> wrote: >>> >>>> It does not matter, in the sense I am ready to upgrade if this thing is >>>> in the roadmap. >>>> >>>> None the less >>>> >>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4 >>>> >>>> >>>> >>>> >>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <[email protected] >>>> > wrote: >>>> >>>>> which version of storm-kafka, are you using? >>>>> >>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <[email protected]> >>>>> wrote: >>>>> >>>>>> Anybody ? Anything about this ? >>>>>> >>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Is there any way I can know what Kafka offset corresponds to current >>>>>>> tuple I am processing in a bolt ? >>>>>>> >>>>>>> Use case : Need to batch events from Kafka, persists them to a local >>>>>>> file and eventually upload it to the S3. To manager failure cases, need >>>>>>> to >>>>>>> know the Kafka offset for a message, so that it can be persisted to >>>>>>> Zookeeper and will be used to write / upload file. >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> Abhishek Agarwal >>>>> >>>>> >>>> >>> >> >
