Re: About python streaming using Cassandra as input

Jeremy Hanna Mon, 09 May 2011 13:40:43 -0700

pig/hive/brisk are certainly great ways of doing mapreduce with cassandra.

I had written the patch to 1497 last Fall and it didn't quite work then.  I had 
meant to get back to it, but since then I've changed jobs and have been really 
busy there.


I do like how the patch abstracts the CFIF/CFRR so that it could be pluggable 
with different formats - like avro or others.  That would make it possible to 
plug it in more easily into dumbo - https://github.com/klbostee/dumbo/wiki/ for 
instance.  It has an abstract parent that does all of the Cassandra heavy 
lifting and each child only does things specific to avro or etc.

If anyone wants to get the patch working against current 0.7-branch, I wouldn't 
mind answering questions about it.  I just don't currently have time to rebase 
and get it working.  I would only take the idea and structure from the patch 
and use as much of 0.7-branch code as possible.  Doing that, it should be 
fairly straightforward as it just adds older mapred package method support and 
abstracts out the Cassandra specific bits.  

On May 9, 2011, at 3:14 PM, Jonathan Ellis wrote:

> You'll have a lot more luck w/ pig or hive as a high-level hadoop
> client, than python.  Certainly until 1470 is done for real.
> 
> Brisk does the hadoop-on-cassandra integration for you:
> http://www.datastax.com/docs/0.8/brisk/about_brisk#key-features-of-brisk
> 
> On Mon, May 9, 2011 at 2:37 AM, Danhang Tang <da...@zugoservices.com> wrote:
>> Hi all,
>> 
>> I've been trying to apply this patch to Cassandra but ran into some errors.
>> https://issues.apache.org/jira/browse/CASSANDRA-1497
>> 
>> The comments said it's fixed for version 0.7.1. But I can't directly apply
>> it to this version. So I apply it manually to the java files in hadoop
>> package. Compiling was successful. But then when executing the
>> hadoop_streaming_input
>> I encountered a runtime error:
>> 
>> 11/05/06 17:27:21 WARN conf.Configuration: mapred.job.tracker is deprecated.
>> Instead, use mapreduce.jobtracker.address
>> 
>> packageJobJar: [./bin/../../../interface/avro/cassandra.avpr,
>> ./bin/mapper.py, ./bin/reducer.py,
>> /tmp/hadoop-radfactory/hadoop-unjar8363580286439315517/] []
>> /tmp/streamjob4200946905356051819.jar tmpDir=null
>> 
>> 11/05/06 17:27:23 INFO mapreduce.JobSubmitter: Cleaning up the staging area
>> hdfs://client1:9001/tmp/hadoop-root/mapred/staging/radfactory/.staging/job_201105051628_0015
>> 
>> Exception in thread "main" java.lang.InstantiationError:
>> org.apache.hadoop.mapreduce.JobContext
>> 
>> at
>> org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:138)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:428)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:420)
>> 
>> at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
>> 
>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
>> 
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
>> 
>> at
>> org.apache.hadoop.streaming.StreamJob.submitAndMonitorJob(StreamJob.java:924)
>> 
>> at org.apache.hadoop.streaming.StreamJob.run(StreamJob.java:123)
>> 
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
>> 
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
>> 
>> at org.apache.hadoop.streaming.HadoopStreaming.main(HadoopStreaming.java:50)
>> 
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> 
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> 
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> 
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
>> 
>> 
>> 
>> Any ideas?
>> 
>> Thanks,
>> 
>> Danny
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com

Re: About python streaming using Cassandra as input

Reply via email to