That's good info! Thanks!
2013/1/16 William Oberman <ober...@civicscience.com> > DataStax recommended (forget the reference) to use the ephemeral disks in > RAID0, which is what I've been running for well over a year now in > production. > > In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the > split data center thing (one DC for low latency queries, one DC for > hadoop). But, that's a lot of system management. And compute is the most > expensive part of AWS, and you need a LOT of compute to run this setup. I > tried doing Cassandra EC2 cluster -> snapshot -> clone cluster with hadoop > overlay -> ETL to S3 using hadoop -> EMR for real work. But that's kind of > a pain too (and the ETL to S3 wasn't very fast). > > Now I'm going after the SStables directly(*), which sounds like how > Netflix does it. You can do incremental updates, if you're careful. > > (*) Cassandra EC2 -> backup to "local" EBS -> remap EBS to another box -> > sstable2json over "new" sstables -> S3 (splitting into ~100MB parts), then > use EMR to consume the JSON part files. > > will > > > > On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle < > mvall...@gmail.com> wrote: > >> William, >> >> I just saw your message today. I am using Cassandra + Amazon EMR >> (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration >> vars in Java, as I have a custom jar file and I am using >> ColumnFamilyInputFormat. >> However, if I understood well your problem, the only thing you have >> to do is to set environment vars when running cluster tasks, right? Take a >> look a this link: >> http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/ >> As it shows, you can run EMR setting some command line arguments that >> specify a script to be executed before the job starts, in each machine in >> the cluster. This way, you would be able to correctly set the vars you need. >> Out of curiosity, could you share what are you using for cassandra >> storage? I am currently using EC2 local disks, but I am looking for an >> alternative. >> >> Best regards, >> Marcelo. >> >> >> 2013/1/4 William Oberman <ober...@civicscience.com> >> >>> So I've made it work, but I don't "get it" yet. >>> >>> I have no idea why my DIY server works when I set the environment >>> variables on the machine that kicks off pig ("master"), and in EMR it >>> doesn't. I recompiled ConfigHelper and CassandraStorage with tons of >>> debugging, and in EMR I can see the hadoop Configuration object get the >>> proper values on the master node, and I can see it does NOT propagate to >>> the task threads. >>> >>> The other part that was driving me nuts could be made more user >>> friendly. The issue is this: I started to try to set >>> cassandra.thrift.address, cassandra.thrift.port, >>> cassandra.partitioner.class in mapred-site.xml, and it didn't work. After >>> even more painful debugging, I noticed that the only time Cassandra sets >>> the input/output versions of those settings (and these input/output >>> specific versions are the only versions really used!) is when Cassandra >>> maps the system environment variables. So, having cassandra.thrift.address >>> in mapred-site.xml does NOTHING, as I needed to >>> have cassandra.output.thrift.address set. It would be much nicer if the >>> get{Input/Output}XYZ checked for the existence of getXYZ >>> if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), >>> if that setting is null, it would have been nice if that method returned >>> getThriftAddress(). My problem went away when I put the full cross product >>> in the XML. E.g. cassandra.input.thrift.address >>> and cassandra.output.thrift.address (and port, and partitioner). >>> >>> I still want to know why the old easy way (of setting the 3 system >>> variables on the pig starter box, and having the config flow into the task >>> trackers) doesn't work! >>> >>> will >>> >>> >>> On Fri, Jan 4, 2013 at 9:04 AM, William Oberman < >>> ober...@civicscience.com> wrote: >>> >>>> On all tasktrackers, I see: >>>> java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS >>>> environment variable not set >>>> at >>>> org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67) >>>> at >>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) >>>> at org.apache.hadoop.mapred.Task.initialize(Task.java:515) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) >>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:396) >>>> at >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) >>>> at org.apache.hadoop.mapred.Child.main(Child.java:249) >>>> >>>> >>>> On Thu, Jan 3, 2013 at 10:45 PM, aaron morton >>>> <aa...@thelastpickle.com>wrote: >>>> >>>>> Instead, I get an error from CassandraStorage that the initial address >>>>> isn't set (on the slave, the master is ok). >>>>> >>>>> Can you post the full error ? >>>>> >>>>> Cheers >>>>> ----------------- >>>>> Aaron Morton >>>>> Freelance Cassandra Developer >>>>> New Zealand >>>>> >>>>> @aaronmorton >>>>> http://www.thelastpickle.com >>>>> >>>>> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com> >>>>> wrote: >>>>> >>>>> Anyone ever try to read or write directly between EMR <-> Cassandra? >>>>> >>>>> I'm running various Cassandra resources in Ec2, so the "physical >>>>> connection" part is pretty easy using security groups. But, I'm having >>>>> some configuration issues. I have managed to get Cassandra + Hadoop >>>>> working in the past using a DIY hadoop cluster, and looking at the >>>>> configurations in the two environments (EMR vs DIY), I'm not sure what's >>>>> different that is causing my failures... I should probably note I'm using >>>>> the Pig integration of Cassandra. >>>>> >>>>> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. >>>>> >>>>> I'm 99% sure I have classpaths working (because I didn't at first, and >>>>> now EMR can find and instantiate CassandraStorage on master and slaves). >>>>> What isn't working are the system variables. In my DIY cluster, all I >>>>> needed to do was: >>>>> ------- >>>>> export PIG_INITIAL_ADDRESS=XXX >>>>> export PIG_RPC_PORT=9160 >>>>> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner >>>>> ---------- >>>>> And the task trackers somehow magically picked up the values (I never >>>>> questioned how/why). But, in EMR, they do not. Instead, I get an error >>>>> from CassandraStorage that the initial address isn't set (on the slave, >>>>> the >>>>> master is ok). >>>>> >>>>> My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe >>>>> the problem is a different version of hadoop? >>>>> >>>>> Looking at the CassandraStorage class, I realize I have no idea how it >>>>> used to work, since it only seems to look at System variables. Those >>>>> variables are set on the Job.getConfiguration object. I don't know how >>>>> that part of hadoop works though... do variables that get set on Job on >>>>> the >>>>> master get propagated to the task threads? I do know that on my DIY >>>>> cluster, I do NOT set those system variables on the slaves... >>>>> >>>>> Thanks! >>>>> >>>>> will >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> -- >> Marcelo Elias Del Valle >> http://mvalle.com - @mvallebr >> > > > > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr