pig/hive/brisk are certainly great ways of doing mapreduce with cassandra. I had written the patch to 1497 last Fall and it didn't quite work then. I had meant to get back to it, but since then I've changed jobs and have been really busy there.
I do like how the patch abstracts the CFIF/CFRR so that it could be pluggable with different formats - like avro or others. That would make it possible to plug it in more easily into dumbo - https://github.com/klbostee/dumbo/wiki/ for instance. It has an abstract parent that does all of the Cassandra heavy lifting and each child only does things specific to avro or etc. If anyone wants to get the patch working against current 0.7-branch, I wouldn't mind answering questions about it. I just don't currently have time to rebase and get it working. I would only take the idea and structure from the patch and use as much of 0.7-branch code as possible. Doing that, it should be fairly straightforward as it just adds older mapred package method support and abstracts out the Cassandra specific bits. On May 9, 2011, at 3:14 PM, Jonathan Ellis wrote: > You'll have a lot more luck w/ pig or hive as a high-level hadoop > client, than python. Certainly until 1470 is done for real. > > Brisk does the hadoop-on-cassandra integration for you: > http://www.datastax.com/docs/0.8/brisk/about_brisk#key-features-of-brisk > > On Mon, May 9, 2011 at 2:37 AM, Danhang Tang <da...@zugoservices.com> wrote: >> Hi all, >> >> I've been trying to apply this patch to Cassandra but ran into some errors. >> https://issues.apache.org/jira/browse/CASSANDRA-1497 >> >> The comments said it's fixed for version 0.7.1. But I can't directly apply >> it to this version. So I apply it manually to the java files in hadoop >> package. Compiling was successful. But then when executing the >> hadoop_streaming_input >> I encountered a runtime error: >> >> 11/05/06 17:27:21 WARN conf.Configuration: mapred.job.tracker is deprecated. >> Instead, use mapreduce.jobtracker.address >> >> packageJobJar: [./bin/../../../interface/avro/cassandra.avpr, >> ./bin/mapper.py, ./bin/reducer.py, >> /tmp/hadoop-radfactory/hadoop-unjar8363580286439315517/] [] >> /tmp/streamjob4200946905356051819.jar tmpDir=null >> >> 11/05/06 17:27:23 INFO mapreduce.JobSubmitter: Cleaning up the staging area >> hdfs://client1:9001/tmp/hadoop-root/mapred/staging/radfactory/.staging/job_201105051628_0015 >> >> Exception in thread "main" java.lang.InstantiationError: >> org.apache.hadoop.mapreduce.JobContext >> >> at >> org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:138) >> >> at >> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428) >> >> at >> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420) >> >> at >> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338) >> >> at org.apache.hadoop.mapreduce.Job.submit(Job.java:960) >> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534) >> >> at >> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:924) >> >> at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:123) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83) >> >> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >> at java.lang.reflect.Method.invoke(Method.java:597) >> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:192) >> >> >> >> Any ideas? >> >> Thanks, >> >> Danny >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com