Anyone ever try to read or write directly between EMR <-> Cassandra?

I'm running various Cassandra resources in Ec2, so the "physical
connection" part is pretty easy using security groups.  But, I'm having
some configuration issues.  I have managed to get Cassandra + Hadoop
working in the past using a DIY hadoop cluster, and looking at the
configurations in the two environments (EMR vs DIY), I'm not sure what's
different that is causing my failures...  I should probably note I'm using
the Pig integration of Cassandra.

Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7.

I'm 99% sure I have classpaths working (because I didn't at first, and now
EMR can find and instantiate CassandraStorage on master and slaves).  What
isn't working are the system variables.  In my DIY cluster, all I needed to
do was:
-------
export PIG_INITIAL_ADDRESS=XXX
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
----------
And the task trackers somehow magically picked up the values (I never
questioned how/why).  But, in EMR, they do not.  Instead, I get an error
from CassandraStorage that the initial address isn't set (on the slave, the
master is ok).

My DIY cluster used CDH3, which was hadoop 0.20.something.  So, maybe the
problem is a different version of hadoop?

Looking at the CassandraStorage class, I realize I have no idea how it used
to work, since it only seems to look at System variables.  Those variables
are set on the Job.getConfiguration object.  I don't know how that part of
hadoop works though... do variables that get set on Job on the master get
propagated to the task threads?  I do know that on my DIY cluster, I do NOT
set those system variables on the slaves...

Thanks!

will

Reply via email to