Hey All, There has been interesting in getting something a little more sophisticated then the Input- and OutputFormat we include in contrib for reading Kafka data into HDFS.
Internally at LinkedIn we have had a pretty sophisticated system that we use for Kafka ETL. It automatically discovers topics, does date partitioning, balances load for many topics, etc. We have wanted to open source this for a while but haven't really had time to spend on it. This code is now open source: https://github.com/linkedin/camus Ken Goodhope is the lead for this system. If you have any questions there is a mailing list here: camus_...@googlegroups.com We haven't done a ton of packaging work on this yet so there isn't a ton of documentation and it is a bit of work to get set up. So it is probably most appropriate for people who would be taking a "white box" approach to the code. We have had interest from a few groups in contributing and we are definitely interested in recruiting this kind of help. All our own development going forward will be done off the public github repo, as usual with LinkedIn open source projects. Until we get better docs up, you can get a pretty good high-level overview of our setup from this paper: http://sites.computer.org/debull/A12june/pipeline.pdf -Jay