William, I just saw your message today. I am using Cassandra + Amazon EMR (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration vars in Java, as I have a custom jar file and I am using ColumnFamilyInputFormat. However, if I understood well your problem, the only thing you have to do is to set environment vars when running cluster tasks, right? Take a look a this link: http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/ As it shows, you can run EMR setting some command line arguments that specify a script to be executed before the job starts, in each machine in the cluster. This way, you would be able to correctly set the vars you need. Out of curiosity, could you share what are you using for cassandra storage? I am currently using EC2 local disks, but I am looking for an alternative.
Best regards, Marcelo. 2013/1/4 William Oberman <ober...@civicscience.com> > So I've made it work, but I don't "get it" yet. > > I have no idea why my DIY server works when I set the environment > variables on the machine that kicks off pig ("master"), and in EMR it > doesn't. I recompiled ConfigHelper and CassandraStorage with tons of > debugging, and in EMR I can see the hadoop Configuration object get the > proper values on the master node, and I can see it does NOT propagate to > the task threads. > > The other part that was driving me nuts could be made more user friendly. > The issue is this: I started to try to set > cassandra.thrift.address, cassandra.thrift.port, > cassandra.partitioner.class in mapred-site.xml, and it didn't work. After > even more painful debugging, I noticed that the only time Cassandra sets > the input/output versions of those settings (and these input/output > specific versions are the only versions really used!) is when Cassandra > maps the system environment variables. So, having cassandra.thrift.address > in mapred-site.xml does NOTHING, as I needed to > have cassandra.output.thrift.address set. It would be much nicer if the > get{Input/Output}XYZ checked for the existence of getXYZ > if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), > if that setting is null, it would have been nice if that method returned > getThriftAddress(). My problem went away when I put the full cross product > in the XML. E.g. cassandra.input.thrift.address > and cassandra.output.thrift.address (and port, and partitioner). > > I still want to know why the old easy way (of setting the 3 system > variables on the pig starter box, and having the config flow into the task > trackers) doesn't work! > > will > > > On Fri, Jan 4, 2013 at 9:04 AM, William Oberman > <ober...@civicscience.com>wrote: > >> On all tasktrackers, I see: >> java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS >> environment variable not set >> at >> org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67) >> at >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) >> at org.apache.hadoop.mapred.Task.initialize(Task.java:515) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) >> at org.apache.hadoop.mapred.Child.main(Child.java:249) >> >> >> On Thu, Jan 3, 2013 at 10:45 PM, aaron morton <aa...@thelastpickle.com>wrote: >> >>> Instead, I get an error from CassandraStorage that the initial address >>> isn't set (on the slave, the master is ok). >>> >>> Can you post the full error ? >>> >>> Cheers >>> ----------------- >>> Aaron Morton >>> Freelance Cassandra Developer >>> New Zealand >>> >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com> >>> wrote: >>> >>> Anyone ever try to read or write directly between EMR <-> Cassandra? >>> >>> I'm running various Cassandra resources in Ec2, so the "physical >>> connection" part is pretty easy using security groups. But, I'm having >>> some configuration issues. I have managed to get Cassandra + Hadoop >>> working in the past using a DIY hadoop cluster, and looking at the >>> configurations in the two environments (EMR vs DIY), I'm not sure what's >>> different that is causing my failures... I should probably note I'm using >>> the Pig integration of Cassandra. >>> >>> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. >>> >>> I'm 99% sure I have classpaths working (because I didn't at first, and >>> now EMR can find and instantiate CassandraStorage on master and slaves). >>> What isn't working are the system variables. In my DIY cluster, all I >>> needed to do was: >>> ------- >>> export PIG_INITIAL_ADDRESS=XXX >>> export PIG_RPC_PORT=9160 >>> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner >>> ---------- >>> And the task trackers somehow magically picked up the values (I never >>> questioned how/why). But, in EMR, they do not. Instead, I get an error >>> from CassandraStorage that the initial address isn't set (on the slave, the >>> master is ok). >>> >>> My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe >>> the problem is a different version of hadoop? >>> >>> Looking at the CassandraStorage class, I realize I have no idea how it >>> used to work, since it only seems to look at System variables. Those >>> variables are set on the Job.getConfiguration object. I don't know how >>> that part of hadoop works though... do variables that get set on Job on the >>> master get propagated to the task threads? I do know that on my DIY >>> cluster, I do NOT set those system variables on the slaves... >>> >>> Thanks! >>> >>> will >>> >>> >>> >> >> >> >> > > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr