Hello all, I haven't been reading the list for the past couple weeks, I've quite busy... but I've searched and didn't find any discussions related to my current issue, so I thought I'd ask while I'm still investigating on my own...!
We've been running a Kafka 0.7.0 cluster without problem for a while now. I've played around<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>with importing data from our Kafka cluster into hadoop a while ago, using the simple Kafka consumer located in the contrib directory of the Kafka source, and that worked properly. At the time, the Hadoop cluster I was running was CDH3u3, IIRC. I'm now revisiting that project with a brand new CDH4.1.2 Hadoop cluster (using MR1, not YARN), and I'm having difficulty getting it to work. At first, the run-class.sh script in kafka/contrib/hadoop-consumer wasn't using the proper hadoop jars to connect to my cluster, so I tweaked it so that it includes the output of the `hadoop classpath` command in its classpath. It's now able to connect to my hadoop cluster, but it's telling me that the versions don't match: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 3 at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) ... (I could give the whole stacktrace if you want, but I didn't think that's really relevant...) So anyway, I've messed around with the kafka/project/build/KafkaProject.scala file so that it uses the "2.0.0-mr1-cdh4.1.2" version of hadoop-core, and fetches it from the cloudera repo. I've added the cloudera repo by adding this line at the beginning of the HadoopConsumerProject class section: val clouderaRepo = "Cloudera" at " https://repository.cloudera.com/artifactory/cloudera-repos/" When I run ./sbt update, it fetches the new jars correctly, but then, when I run ./sbt package, it's not able to find a bunch of hadoop related classes and packages in the hadoop-consumer code, which I guess means that a few APIs have changed between the two versions of CDH. I've tried this on the 0.7.0 branch of Kafka (from the Apache git repo) as well as on the 0.7.2 branch, and I get the same result on both (I can't successfully run ./sbt package). The easiest for me would be to get it to work on Kafka 0.7.0, but I guess I could persuade my people to upgrade to 0.7.2 if it's necessary (I'd like us to upgrade, but I guess you all know how it is... getting a working system to change is a political hassle). I don't think we'd be willing to move to Kafka 0.8 just yet, so hopefully that won't be necessary. *TLDR: Is anyone pumping data from Kafka 0.7.x to CDH4.x ? And if so, how? Using the example consumer from kafka's contrib, or another one?* Perhaps this one <https://github.com/miniway/kafka-hadoop-consumer>? (I'll probably give it a try soon, BTW, so I'll keep you guys posted...). I may also try porting the hadoop-consumer contrib to CDH4. Finally, I haven't seen anything mentioned about the LinkedIn kafka/avro/hadoop ETL stuff we've been hearing about for a while. I saw the new LinkedIn DataFu stuff but it seems unrelated. Are there any updates about whether or when the ETL code would get open sourced? As far as we're concerned, we're using avro quite a bit, so in our case, the avro coupling would definitely not be an issue. I don't know what version(s) of hadoop LinkedIn is running, though, so perhaps their stuff wouldn't work out of the box with CDH4 either anyway... Any advice would be appreciated! Thanks :) ! -- Felix