Yeah. We have some microbatching in place for other topologies. This one is
little ambitious, in the sense each message is 1~2KB in size so grouping
them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
guess, I am not sure how much does S3 support or what is optimum). Once
that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
guess storm is not even meant to support that kind of a use case.

On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <[email protected]> wrote:

> You can micro batch kafka contents into a file that's replicated (e.g.
> HDFS) and then ack all of the input tuples after the file has been closed.
>
> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <[email protected]> wrote:
>
>> in case of failure to upload a file or disk corruption leading to loss of
>> file, we have only current offset in Kafka Spout but have no record as to
>> which offsets were lost in the file which need to be replayed. So these can
>> be stored externally in zookeeper and could be used to account for lost
>> data. For them to save in ZK, they should be available in a bolt.
>>
>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <[email protected]> wrote:
>>
>>> Why not just ack the tuple once it's been written to a file.  If your
>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>> already does this for you.  Then uploading files to S3 is the
>>> responsibility of another job.  For example, a storm topology that monitors
>>> the output folder.
>>>
>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>
>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <[email protected]>
>>> wrote:
>>>
>>>> It does not matter, in the sense I am ready to upgrade if this thing is
>>>> in the roadmap.
>>>>
>>>> None the less
>>>>
>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <[email protected]
>>>> > wrote:
>>>>
>>>>> which version of storm-kafka, are you using?
>>>>>
>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Anybody ? Anything about this ?
>>>>>>
>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>>>> tuple I am processing in a bolt ?
>>>>>>>
>>>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>>>> file and eventually upload it to the S3. To manager failure cases, need 
>>>>>>> to
>>>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Abhishek Agarwal
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to