Anyone ever try to read or write directly between EMR <-> Cassandra? I'm running various Cassandra resources in Ec2, so the "physical connection" part is pretty easy using security groups. But, I'm having some configuration issues. I have managed to get Cassandra + Hadoop working in the past using a DIY hadoop cluster, and looking at the configurations in the two environments (EMR vs DIY), I'm not sure what's different that is causing my failures... I should probably note I'm using the Pig integration of Cassandra.
Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. I'm 99% sure I have classpaths working (because I didn't at first, and now EMR can find and instantiate CassandraStorage on master and slaves). What isn't working are the system variables. In my DIY cluster, all I needed to do was: ------- export PIG_INITIAL_ADDRESS=XXX export PIG_RPC_PORT=9160 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner ---------- And the task trackers somehow magically picked up the values (I never questioned how/why). But, in EMR, they do not. Instead, I get an error from CassandraStorage that the initial address isn't set (on the slave, the master is ok). My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe the problem is a different version of hadoop? Looking at the CassandraStorage class, I realize I have no idea how it used to work, since it only seems to look at System variables. Those variables are set on the Job.getConfiguration object. I don't know how that part of hadoop works though... do variables that get set on Job on the master get propagated to the task threads? I do know that on my DIY cluster, I do NOT set those system variables on the slaves... Thanks! will