Re: AWS EMR <-> Cassandra

William Oberman Wed, 16 Jan 2013 12:59:35 -0800

DataStax recommended (forget the reference) to use the ephemeral disks in
RAID0, which is what I've been running for well over a year now in
production.


In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the
split data center thing (one DC for low latency queries, one DC for
hadoop).  But, that's a lot of system management.  And compute is the most
expensive part of AWS, and you need a LOT of compute to run this setup.  I
tried doing Cassandra EC2 cluster -> snapshot -> clone cluster with hadoop
overlay -> ETL to S3 using hadoop -> EMR for real work.  But that's kind of
a pain too (and the ETL to S3 wasn't very fast).

Now I'm going after the SStables directly(*), which sounds like how Netflix
does it.  You can do incremental updates, if you're careful.

(*) Cassandra EC2 -> backup to "local" EBS -> remap EBS to another box ->
sstable2json over "new" sstables -> S3 (splitting into ~100MB parts), then
use EMR to consume the JSON part files.

will


On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle <mvall...@gmail.com
> wrote:

> William,
>
>     I just saw your message today. I am using Cassandra + Amazon EMR
> (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration
> vars in Java, as I have a custom jar file and I am using
> ColumnFamilyInputFormat.
>     However, if I understood well your problem, the only thing you have to
> do is to set environment vars when running cluster tasks, right? Take a
> look a this link:
>     http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
>     As it shows, you can run EMR setting some command line arguments that
> specify a script to be executed before the job starts, in each machine in
> the cluster. This way, you would be able to correctly set the vars you need.
>      Out of curiosity, could you share what are you using for cassandra
> storage? I am currently using EC2 local disks, but I am looking for an
> alternative.
>
> Best regards,
> Marcelo.
>
>
> 2013/1/4 William Oberman <ober...@civicscience.com>
>
>> So I've made it work, but I don't "get it" yet.
>>
>> I have no idea why my DIY server works when I set the environment
>> variables on the machine that kicks off pig ("master"), and in EMR it
>> doesn't.  I recompiled ConfigHelper and CassandraStorage with tons of
>> debugging, and in EMR I can see the hadoop Configuration object get the
>> proper values on the master node, and I can see it does NOT propagate to
>> the task threads.
>>
>> The other part that was driving me nuts could be made more user friendly.
>>  The issue is this: I started to try to set
>> cassandra.thrift.address, cassandra.thrift.port,
>> cassandra.partitioner.class in mapred-site.xml, and it didn't work.  After
>> even more painful debugging, I noticed that the only time Cassandra sets
>> the input/output versions of those settings (and these input/output
>> specific versions are the only versions really used!) is when Cassandra
>> maps the system environment variables.  So, having cassandra.thrift.address
>> in mapred-site.xml does NOTHING, as I needed to
>> have cassandra.output.thrift.address set.  It would be much nicer if the
>> get{Input/Output}XYZ checked for the existence of getXYZ
>> if get{Input/Output}XYZ is empty/null.  E.g. in getOutputThriftAddress(),
>> if that setting is null, it would have been nice if that method returned
>> getThriftAddress().  My problem went away when I put the full cross product
>> in the XML. E.g. cassandra.input.thrift.address
>> and cassandra.output.thrift.address (and port, and partitioner).
>>
>> I still want to know why the old easy way (of setting the 3 system
>> variables on the pig starter box, and having the config flow into the task
>> trackers) doesn't work!
>>
>> will
>>
>>
>> On Fri, Jan 4, 2013 at 9:04 AM, William Oberman <ober...@civicscience.com
>> > wrote:
>>
>>> On all tasktrackers, I see:
>>> java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS
>>> environment variable not set
>>>         at
>>> org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821)
>>>         at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170)
>>>         at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112)
>>>         at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86)
>>>         at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67)
>>>         at
>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279)
>>>         at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
>>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358)
>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>
>>>
>>> On Thu, Jan 3, 2013 at 10:45 PM, aaron morton 
>>> <aa...@thelastpickle.com>wrote:
>>>
>>>> Instead, I get an error from CassandraStorage that the initial address
>>>> isn't set (on the slave, the master is ok).
>>>>
>>>> Can you post the full error ?
>>>>
>>>> Cheers
>>>>    -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> New Zealand
>>>>
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com>
>>>> wrote:
>>>>
>>>> Anyone ever try to read or write directly between EMR <-> Cassandra?
>>>>
>>>> I'm running various Cassandra resources in Ec2, so the "physical
>>>> connection" part is pretty easy using security groups.  But, I'm having
>>>> some configuration issues.  I have managed to get Cassandra + Hadoop
>>>> working in the past using a DIY hadoop cluster, and looking at the
>>>> configurations in the two environments (EMR vs DIY), I'm not sure what's
>>>> different that is causing my failures...  I should probably note I'm using
>>>> the Pig integration of Cassandra.
>>>>
>>>> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7.
>>>>
>>>> I'm 99% sure I have classpaths working (because I didn't at first, and
>>>> now EMR can find and instantiate CassandraStorage on master and slaves).
>>>>  What isn't working are the system variables.  In my DIY cluster, all I
>>>> needed to do was:
>>>> -------
>>>> export PIG_INITIAL_ADDRESS=XXX
>>>> export PIG_RPC_PORT=9160
>>>> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
>>>> ----------
>>>> And the task trackers somehow magically picked up the values (I never
>>>> questioned how/why).  But, in EMR, they do not.  Instead, I get an error
>>>> from CassandraStorage that the initial address isn't set (on the slave, the
>>>> master is ok).
>>>>
>>>> My DIY cluster used CDH3, which was hadoop 0.20.something.  So, maybe
>>>> the problem is a different version of hadoop?
>>>>
>>>> Looking at the CassandraStorage class, I realize I have no idea how it
>>>> used to work, since it only seems to look at System variables.  Those
>>>> variables are set on the Job.getConfiguration object.  I don't know how
>>>> that part of hadoop works though... do variables that get set on Job on the
>>>> master get propagated to the task threads?  I do know that on my DIY
>>>> cluster, I do NOT set those system variables on the slaves...
>>>>
>>>> Thanks!
>>>>
>>>> will
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>

Re: AWS EMR <-> Cassandra

Reply via email to