DataStax recommended (forget the reference) to use the ephemeral disks in RAID0, which is what I've been running for well over a year now in production.
In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the split data center thing (one DC for low latency queries, one DC for hadoop). But, that's a lot of system management. And compute is the most expensive part of AWS, and you need a LOT of compute to run this setup. I tried doing Cassandra EC2 cluster -> snapshot -> clone cluster with hadoop overlay -> ETL to S3 using hadoop -> EMR for real work. But that's kind of a pain too (and the ETL to S3 wasn't very fast). Now I'm going after the SStables directly(*), which sounds like how Netflix does it. You can do incremental updates, if you're careful. (*) Cassandra EC2 -> backup to "local" EBS -> remap EBS to another box -> sstable2json over "new" sstables -> S3 (splitting into ~100MB parts), then use EMR to consume the JSON part files. will On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle <mvall...@gmail.com > wrote: > William, > > I just saw your message today. I am using Cassandra + Amazon EMR > (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration > vars in Java, as I have a custom jar file and I am using > ColumnFamilyInputFormat. > However, if I understood well your problem, the only thing you have to > do is to set environment vars when running cluster tasks, right? Take a > look a this link: > http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/ > As it shows, you can run EMR setting some command line arguments that > specify a script to be executed before the job starts, in each machine in > the cluster. This way, you would be able to correctly set the vars you need. > Out of curiosity, could you share what are you using for cassandra > storage? I am currently using EC2 local disks, but I am looking for an > alternative. > > Best regards, > Marcelo. > > > 2013/1/4 William Oberman <ober...@civicscience.com> > >> So I've made it work, but I don't "get it" yet. >> >> I have no idea why my DIY server works when I set the environment >> variables on the machine that kicks off pig ("master"), and in EMR it >> doesn't. I recompiled ConfigHelper and CassandraStorage with tons of >> debugging, and in EMR I can see the hadoop Configuration object get the >> proper values on the master node, and I can see it does NOT propagate to >> the task threads. >> >> The other part that was driving me nuts could be made more user friendly. >> The issue is this: I started to try to set >> cassandra.thrift.address, cassandra.thrift.port, >> cassandra.partitioner.class in mapred-site.xml, and it didn't work. After >> even more painful debugging, I noticed that the only time Cassandra sets >> the input/output versions of those settings (and these input/output >> specific versions are the only versions really used!) is when Cassandra >> maps the system environment variables. So, having cassandra.thrift.address >> in mapred-site.xml does NOTHING, as I needed to >> have cassandra.output.thrift.address set. It would be much nicer if the >> get{Input/Output}XYZ checked for the existence of getXYZ >> if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), >> if that setting is null, it would have been nice if that method returned >> getThriftAddress(). My problem went away when I put the full cross product >> in the XML. E.g. cassandra.input.thrift.address >> and cassandra.output.thrift.address (and port, and partitioner). >> >> I still want to know why the old easy way (of setting the 3 system >> variables on the pig starter box, and having the config flow into the task >> trackers) doesn't work! >> >> will >> >> >> On Fri, Jan 4, 2013 at 9:04 AM, William Oberman <ober...@civicscience.com >> > wrote: >> >>> On all tasktrackers, I see: >>> java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS >>> environment variable not set >>> at >>> org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) >>> at >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) >>> at >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) >>> at >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) >>> at >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67) >>> at >>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) >>> at org.apache.hadoop.mapred.Task.initialize(Task.java:515) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) >>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:396) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) >>> at org.apache.hadoop.mapred.Child.main(Child.java:249) >>> >>> >>> On Thu, Jan 3, 2013 at 10:45 PM, aaron morton >>> <aa...@thelastpickle.com>wrote: >>> >>>> Instead, I get an error from CassandraStorage that the initial address >>>> isn't set (on the slave, the master is ok). >>>> >>>> Can you post the full error ? >>>> >>>> Cheers >>>> ----------------- >>>> Aaron Morton >>>> Freelance Cassandra Developer >>>> New Zealand >>>> >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com> >>>> wrote: >>>> >>>> Anyone ever try to read or write directly between EMR <-> Cassandra? >>>> >>>> I'm running various Cassandra resources in Ec2, so the "physical >>>> connection" part is pretty easy using security groups. But, I'm having >>>> some configuration issues. I have managed to get Cassandra + Hadoop >>>> working in the past using a DIY hadoop cluster, and looking at the >>>> configurations in the two environments (EMR vs DIY), I'm not sure what's >>>> different that is causing my failures... I should probably note I'm using >>>> the Pig integration of Cassandra. >>>> >>>> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. >>>> >>>> I'm 99% sure I have classpaths working (because I didn't at first, and >>>> now EMR can find and instantiate CassandraStorage on master and slaves). >>>> What isn't working are the system variables. In my DIY cluster, all I >>>> needed to do was: >>>> ------- >>>> export PIG_INITIAL_ADDRESS=XXX >>>> export PIG_RPC_PORT=9160 >>>> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner >>>> ---------- >>>> And the task trackers somehow magically picked up the values (I never >>>> questioned how/why). But, in EMR, they do not. Instead, I get an error >>>> from CassandraStorage that the initial address isn't set (on the slave, the >>>> master is ok). >>>> >>>> My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe >>>> the problem is a different version of hadoop? >>>> >>>> Looking at the CassandraStorage class, I realize I have no idea how it >>>> used to work, since it only seems to look at System variables. Those >>>> variables are set on the Job.getConfiguration object. I don't know how >>>> that part of hadoop works though... do variables that get set on Job on the >>>> master get propagated to the task threads? I do know that on my DIY >>>> cluster, I do NOT set those system variables on the slaves... >>>> >>>> Thanks! >>>> >>>> will >>>> >>>> >>>> >>> >>> >>> >>> >> >> > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr >