David, That's interesting. Kafka provides an infinite stream of data whereas Pig works on a finite amount of data. How did you solve the mismatch?
Thanks, Jun On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <mum...@gmail.com> wrote: > I've thrown together a Pig LoadFunc to read data from Kafka, so you could > load data like: > > QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using > com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query'); > > The path part of the uri is the Kafka topic, and the fragment is the > number of partitions. In the implementation I have, it makes one input > split per partition. Offsets are not really dealt with at this point - it's > a rough prototype. > > Anyone have thoughts on whether or not this is a good idea? I know usually > the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this > data from Kafka once, is there any reason why I can't skip writing to HDFS? > > Thanks! > -David >