Would you please contribute this to open source? What you've written has been asked for many times. FWIW, I would immediately incorporate it into my book, Agile Data.
Russell Jurney http://datasyndrome.com On Dec 28, 2012, at 8:06 AM, Liam Stewart <liam.stew...@gmail.com> wrote: > We have a tool that reads data continuously from brokers and then writes > files to S3. A MR job didn't make sense for us given our current size and > volume. We have one instance running right now and could add more by if > needed, adjusting which instance reads from which brokers/topics/... > Unfortunately, part of the implementation is tied to our internals so I > can't open source it at this point. The idea is roughly: > > - one simple consumer per broker / topic / partition; reads data in batches > and writes to local temp files; reads are done in parallel > - files are finalized after a given size or age (to better handle low > volume topics) and then written to S3. all uploads are done in a separate > thread pool (using the aws transfer manager) and don't block reads from > kafka unless too many files get backed up due to either problems with S3 or > upload speed > - after a file is written, the next offset to read is written to zookeeper > > For me, part of the implementation was as an exercise to get experience > with zookeeper so that was one reason for using the lower-level API and > handling offset tracking etc myself. > > On Fri, Dec 28, 2012 at 6:56 AM, Pratyush Chandra < > chandra.praty...@gmail.com> wrote: > >> I went through the source code of Hadoop consumer in contrib. It doesn't >> seem to be using previous offset at all. Neither in Data Generator or in >> Map reduce stage. >> >> Before I go into the implementation, I can think of 2 ways : >> 1. A consumerconnector receiving all the messages continuously, and then >> writing it to HDFS (in this case S3). Problem is autocommit is handled >> internally, and there is no handler function while committing offset, which >> can be used to upload file. >> 2. Wake up every one minute, pull all the data using simple consumer into a >> local file and put to HDFS. >> >> So, what is better approach ? >> - Listen continuously vs in batch >> - Use consumerconnector (where auto commit/offsets are handled internally) >> vs simple consumer (which doesnot use zk, so I need to connect to each >> broker individually) >> >> Pratyush >> >> On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <mum...@gmail.com> wrote: >> >>> I don't think anything exists like this in Kafka (or contrib), but it >>> would be a useful addition! Personally, I have written this exact thing >> at >>> previous jobs. >>> >>> As for the Hadoop consumer, since there is a FileSystem implementation >> for >>> S3 in Hadoop, it should be possible. The Hadoop consumer works by writing >>> out data files containing the Kafka messages along side offset files >> which >>> contain the last offset read for each partition. If it is re-consuming >> from >>> zero each time you run it, it means it's not finding the offset files >> from >>> the previous run. >>> >>> Having used it a bit, the Hadoop consumer is certainly an area that could >>> use improvement. >>> >>> HTH, >>> David >>> >>> >>> On 12/27/12 4:41 AM, Pratyush Chandra wrote: >>> >>>> Hi, >>>> >>>> I am looking for a S3 based consumer, which can write all the received >>>> events to S3 bucket (say every minute). Something similar to Flume >>>> HDFSSink >>>> http://flume.apache.org/**FlumeUserGuide.html#hdfs-sink< >> http://flume.apache.org/FlumeUserGuide.html#hdfs-sink> >>>> I have tried evaluating hadoop-consumer in contrib folder. But it seems >> to >>>> be more for offline processing, which will fetch everything from offset >> 0 >>>> at once and replace it in S3 bucket. >>>> Any help would be appreciated ? >>>> >>>> >>> >> >> >> -- >> Pratyush Chandra >> > > > > -- > Liam Stewart :: liam.stew...@gmail.com