Would you please contribute this to open source? What you've written
has been asked for many times. FWIW, I would immediately incorporate
it into my book, Agile Data.

Russell Jurney http://datasyndrome.com

On Dec 28, 2012, at 8:06 AM, Liam Stewart <liam.stew...@gmail.com> wrote:

> We have a tool that reads data continuously from brokers and then writes
> files to S3. A MR job didn't make sense for us given our current size and
> volume. We have one instance running right now and could add more by if
> needed, adjusting which instance reads from which brokers/topics/...
> Unfortunately, part of the implementation is tied to our internals so I
> can't open source it at this point. The idea is roughly:
>
> - one simple consumer per broker / topic / partition; reads data in batches
> and writes to local temp files; reads are done in parallel
> - files are finalized after a given size or age (to better handle low
> volume topics) and then written to S3. all uploads are done in a separate
> thread pool (using the aws transfer manager) and don't block reads from
> kafka unless too many files get backed up due to either problems with S3 or
> upload speed
> - after a file is written, the next offset to read is written to zookeeper
>
> For me, part of the implementation was as an exercise to get experience
> with zookeeper so that was one reason for using the lower-level API and
> handling offset tracking etc myself.
>
> On Fri, Dec 28, 2012 at 6:56 AM, Pratyush Chandra <
> chandra.praty...@gmail.com> wrote:
>
>> I went through the source code of Hadoop consumer in contrib. It doesn't
>> seem to be using previous offset at all. Neither in Data Generator or in
>> Map reduce stage.
>>
>> Before I go into the implementation, I can think of 2 ways :
>> 1. A consumerconnector receiving all the messages continuously, and then
>> writing it to HDFS (in this case S3). Problem is autocommit is handled
>> internally, and there is no handler function while committing offset, which
>> can be used to upload file.
>> 2. Wake up every one minute, pull all the data using simple consumer into a
>> local file and put to HDFS.
>>
>> So, what is better approach ?
>> - Listen continuously vs in batch
>> - Use consumerconnector (where auto commit/offsets are handled internally)
>> vs simple consumer (which doesnot use zk, so I need to connect to each
>> broker individually)
>>
>> Pratyush
>>
>> On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <mum...@gmail.com> wrote:
>>
>>> I don't think anything exists like this in Kafka (or contrib), but it
>>> would be a useful addition! Personally, I have written this exact thing
>> at
>>> previous jobs.
>>>
>>> As for the Hadoop consumer, since there is a FileSystem implementation
>> for
>>> S3 in Hadoop, it should be possible. The Hadoop consumer works by writing
>>> out data files containing the Kafka messages along side offset files
>> which
>>> contain the last offset read for each partition. If it is re-consuming
>> from
>>> zero each time you run it, it means it's not finding the offset files
>> from
>>> the previous run.
>>>
>>> Having used it a bit, the Hadoop consumer is certainly an area that could
>>> use improvement.
>>>
>>> HTH,
>>> David
>>>
>>>
>>> On 12/27/12 4:41 AM, Pratyush Chandra wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for a S3 based consumer, which can write all the received
>>>> events to S3 bucket (say every minute). Something similar to Flume
>>>> HDFSSink
>>>> http://flume.apache.org/**FlumeUserGuide.html#hdfs-sink<
>> http://flume.apache.org/FlumeUserGuide.html#hdfs-sink>
>>>> I have tried evaluating hadoop-consumer in contrib folder. But it seems
>> to
>>>> be more for offline processing, which will fetch everything from offset
>> 0
>>>> at once and replace it in S3 bucket.
>>>> Any help would be appreciated ?
>>>>
>>>>
>>>
>>
>>
>> --
>> Pratyush Chandra
>>
>
>
>
> --
> Liam Stewart :: liam.stew...@gmail.com

Reply via email to