So I've made it work, but I don't "get it" yet. I have no idea why my DIY server works when I set the environment variables on the machine that kicks off pig ("master"), and in EMR it doesn't. I recompiled ConfigHelper and CassandraStorage with tons of debugging, and in EMR I can see the hadoop Configuration object get the proper values on the master node, and I can see it does NOT propagate to the task threads.
The other part that was driving me nuts could be made more user friendly. The issue is this: I started to try to set cassandra.thrift.address, cassandra.thrift.port, cassandra.partitioner.class in mapred-site.xml, and it didn't work. After even more painful debugging, I noticed that the only time Cassandra sets the input/output versions of those settings (and these input/output specific versions are the only versions really used!) is when Cassandra maps the system environment variables. So, having cassandra.thrift.address in mapred-site.xml does NOTHING, as I needed to have cassandra.output.thrift.address set. It would be much nicer if the get{Input/Output}XYZ checked for the existence of getXYZ if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), if that setting is null, it would have been nice if that method returned getThriftAddress(). My problem went away when I put the full cross product in the XML. E.g. cassandra.input.thrift.address and cassandra.output.thrift.address (and port, and partitioner). I still want to know why the old easy way (of setting the 3 system variables on the pig starter box, and having the config flow into the task trackers) doesn't work! will On Fri, Jan 4, 2013 at 9:04 AM, William Oberman <ober...@civicscience.com>wrote: > On all tasktrackers, I see: > java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS > environment variable not set > at > org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) > at org.apache.hadoop.mapred.Task.initialize(Task.java:515) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > On Thu, Jan 3, 2013 at 10:45 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> Instead, I get an error from CassandraStorage that the initial address >> isn't set (on the slave, the master is ok). >> >> Can you post the full error ? >> >> Cheers >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com> >> wrote: >> >> Anyone ever try to read or write directly between EMR <-> Cassandra? >> >> I'm running various Cassandra resources in Ec2, so the "physical >> connection" part is pretty easy using security groups. But, I'm having >> some configuration issues. I have managed to get Cassandra + Hadoop >> working in the past using a DIY hadoop cluster, and looking at the >> configurations in the two environments (EMR vs DIY), I'm not sure what's >> different that is causing my failures... I should probably note I'm using >> the Pig integration of Cassandra. >> >> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. >> >> I'm 99% sure I have classpaths working (because I didn't at first, and >> now EMR can find and instantiate CassandraStorage on master and slaves). >> What isn't working are the system variables. In my DIY cluster, all I >> needed to do was: >> ------- >> export PIG_INITIAL_ADDRESS=XXX >> export PIG_RPC_PORT=9160 >> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner >> ---------- >> And the task trackers somehow magically picked up the values (I never >> questioned how/why). But, in EMR, they do not. Instead, I get an error >> from CassandraStorage that the initial address isn't set (on the slave, the >> master is ok). >> >> My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe the >> problem is a different version of hadoop? >> >> Looking at the CassandraStorage class, I realize I have no idea how it >> used to work, since it only seems to look at System variables. Those >> variables are set on the Job.getConfiguration object. I don't know how >> that part of hadoop works though... do variables that get set on Job on the >> master get propagated to the task threads? I do know that on my DIY >> cluster, I do NOT set those system variables on the slaves... >> >> Thanks! >> >> will >> >> >> > > > >