Re: AWS EMR <-> Cassandra

Marcelo Elias Del Valle Wed, 16 Jan 2013 13:19:41 -0800

That's good info! Thanks!


2013/1/16 William Oberman <ober...@civicscience.com>

> DataStax recommended (forget the reference) to use the ephemeral disks in
> RAID0, which is what I've been running for well over a year now in
> production.
>
> In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the
> split data center thing (one DC for low latency queries, one DC for
> hadoop).  But, that's a lot of system management.  And compute is the most
> expensive part of AWS, and you need a LOT of compute to run this setup.  I
> tried doing Cassandra EC2 cluster -> snapshot -> clone cluster with hadoop
> overlay -> ETL to S3 using hadoop -> EMR for real work.  But that's kind of
> a pain too (and the ETL to S3 wasn't very fast).
>
> Now I'm going after the SStables directly(*), which sounds like how
> Netflix does it.  You can do incremental updates, if you're careful.
>
> (*) Cassandra EC2 -> backup to "local" EBS -> remap EBS to another box ->
> sstable2json over "new" sstables -> S3 (splitting into ~100MB parts), then
> use EMR to consume the JSON part files.
>
> will
>
>
>
> On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle <
> mvall...@gmail.com> wrote:
>
>> William,
>>
>>     I just saw your message today. I am using Cassandra + Amazon EMR
>> (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration
>> vars in Java, as I have a custom jar file and I am using
>> ColumnFamilyInputFormat.
>>     However, if I understood well your problem, the only thing you have
>> to do is to set environment vars when running cluster tasks, right? Take a
>> look a this link:
>>     http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
>>     As it shows, you can run EMR setting some command line arguments that
>> specify a script to be executed before the job starts, in each machine in
>> the cluster. This way, you would be able to correctly set the vars you need.
>>      Out of curiosity, could you share what are you using for cassandra
>> storage? I am currently using EC2 local disks, but I am looking for an
>> alternative.
>>
>> Best regards,
>> Marcelo.
>>
>>
>> 2013/1/4 William Oberman <ober...@civicscience.com>
>>
>>> So I've made it work, but I don't "get it" yet.
>>>
>>> I have no idea why my DIY server works when I set the environment
>>> variables on the machine that kicks off pig ("master"), and in EMR it
>>> doesn't.  I recompiled ConfigHelper and CassandraStorage with tons of
>>> debugging, and in EMR I can see the hadoop Configuration object get the
>>> proper values on the master node, and I can see it does NOT propagate to
>>> the task threads.
>>>
>>> The other part that was driving me nuts could be made more user
>>> friendly.  The issue is this: I started to try to set
>>> cassandra.thrift.address, cassandra.thrift.port,
>>> cassandra.partitioner.class in mapred-site.xml, and it didn't work.  After
>>> even more painful debugging, I noticed that the only time Cassandra sets
>>> the input/output versions of those settings (and these input/output
>>> specific versions are the only versions really used!) is when Cassandra
>>> maps the system environment variables.  So, having cassandra.thrift.address
>>> in mapred-site.xml does NOTHING, as I needed to
>>> have cassandra.output.thrift.address set.  It would be much nicer if the
>>> get{Input/Output}XYZ checked for the existence of getXYZ
>>> if get{Input/Output}XYZ is empty/null.  E.g. in getOutputThriftAddress(),
>>> if that setting is null, it would have been nice if that method returned
>>> getThriftAddress().  My problem went away when I put the full cross product
>>> in the XML. E.g. cassandra.input.thrift.address
>>> and cassandra.output.thrift.address (and port, and partitioner).
>>>
>>> I still want to know why the old easy way (of setting the 3 system
>>> variables on the pig starter box, and having the config flow into the task
>>> trackers) doesn't work!
>>>
>>> will
>>>
>>>
>>> On Fri, Jan 4, 2013 at 9:04 AM, William Oberman <
>>> ober...@civicscience.com> wrote:
>>>
>>>> On all tasktrackers, I see:
>>>> java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS
>>>> environment variable not set
>>>>         at
>>>> org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821)
>>>>         at
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170)
>>>>         at
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112)
>>>>         at
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86)
>>>>         at
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:67)
>>>>         at
>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279)
>>>>         at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
>>>>          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358)
>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>         at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>>
>>>>
>>>> On Thu, Jan 3, 2013 at 10:45 PM, aaron morton 
>>>> <aa...@thelastpickle.com>wrote:
>>>>
>>>>> Instead, I get an error from CassandraStorage that the initial address
>>>>> isn't set (on the slave, the master is ok).
>>>>>
>>>>> Can you post the full error ?
>>>>>
>>>>> Cheers
>>>>>    -----------------
>>>>> Aaron Morton
>>>>> Freelance Cassandra Developer
>>>>> New Zealand
>>>>>
>>>>> @aaronmorton
>>>>> http://www.thelastpickle.com
>>>>>
>>>>> On 4/01/2013, at 11:15 AM, William Oberman <ober...@civicscience.com>
>>>>> wrote:
>>>>>
>>>>> Anyone ever try to read or write directly between EMR <-> Cassandra?
>>>>>
>>>>> I'm running various Cassandra resources in Ec2, so the "physical
>>>>> connection" part is pretty easy using security groups.  But, I'm having
>>>>> some configuration issues.  I have managed to get Cassandra + Hadoop
>>>>> working in the past using a DIY hadoop cluster, and looking at the
>>>>> configurations in the two environments (EMR vs DIY), I'm not sure what's
>>>>> different that is causing my failures...  I should probably note I'm using
>>>>> the Pig integration of Cassandra.
>>>>>
>>>>> Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7.
>>>>>
>>>>> I'm 99% sure I have classpaths working (because I didn't at first, and
>>>>> now EMR can find and instantiate CassandraStorage on master and slaves).
>>>>>  What isn't working are the system variables.  In my DIY cluster, all I
>>>>> needed to do was:
>>>>> -------
>>>>> export PIG_INITIAL_ADDRESS=XXX
>>>>> export PIG_RPC_PORT=9160
>>>>> export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
>>>>> ----------
>>>>> And the task trackers somehow magically picked up the values (I never
>>>>> questioned how/why).  But, in EMR, they do not.  Instead, I get an error
>>>>> from CassandraStorage that the initial address isn't set (on the slave, 
>>>>> the
>>>>> master is ok).
>>>>>
>>>>> My DIY cluster used CDH3, which was hadoop 0.20.something.  So, maybe
>>>>> the problem is a different version of hadoop?
>>>>>
>>>>> Looking at the CassandraStorage class, I realize I have no idea how it
>>>>> used to work, since it only seems to look at System variables.  Those
>>>>> variables are set on the Job.getConfiguration object.  I don't know how
>>>>> that part of hadoop works though... do variables that get set on Job on 
>>>>> the
>>>>> master get propagated to the task threads?  I do know that on my DIY
>>>>> cluster, I do NOT set those system variables on the slaves...
>>>>>
>>>>> Thanks!
>>>>>
>>>>> will
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Marcelo Elias Del Valle
>> http://mvalle.com - @mvallebr
>>
>
>
>
>


-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: AWS EMR <-> Cassandra

Reply via email to