I have a "peer EBS disk" to the ephemeral disk .  Then I do nodetool
snapshot -> rsync from ephemeral to EBS -> take snapshot of EBS.  Syncing
nodetool snapshot directly to S3 would involve less steps and be cheaper
(EBS costs more than S3), but I do post processing on the snapshot for EMR,
and it seemed pointless to push/pull from S3 when I could just map EBS
disks around.  I snapshot the EBS disk as a failsafe, and snapshots are
cheap (they cost the same as S3).

I've seen/read about how other people just watch the data directories for
new SStables and trigger copy to S3 (there are open source projects that do
that for you I believe).

And I think lot of people rely on the replication factor + multiple zones.


On Thu, Jan 17, 2013 at 8:44 AM, Adam Venturella <aventure...@gmail.com>wrote:

> Jared, how do you guys handle data backups for your ephemeral based
> cluster?
> I'm trying to move to ephemeral drives myself, and that was my last
> sticking point; asking how others in the community deal with backup in case
> the VM explodes.
> On Wed, Jan 16, 2013 at 1:21 PM, Jared Biel <jared.b...@bolderthinking.com
> > wrote:
>> We're currently using Cassandra on EC2 at very low scale (a 2 node
>> cluster on m1.large instances in two regions.) I don't believe that
>> EBS is recommended for performance reasons. Also, it's proven to be
>> very unreliable in the past (most of the big/notable AWS outages were
>> due to EBS issues.) We've moved 99% of our instances off of EBS.
>> As other have said, if you require more space in the future it's easy
>> to add more nodes to the cluster. I've found this page
>> (http://www.ec2instances.info/) very useful in determining the amount
>> of space each instance type has. Note that by default only one
>> ephemeral drive is attached and you must specify all ephemeral drives
>> that you want to use at launch time. Also, you can create a RAID 0 of
>> all local disks to provide maximum speed and space.
>> On 16 January 2013 20:42, Marcelo Elias Del Valle <mvall...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> >    I am currently using hadoop + cassandra at amazon AWS. Cassandra
>> runs on
>> > EC2 and my hadoop process runs at EMR. For cassandra storage, I am using
>> > local EC2 EBS disks.
>> >    My system is running fine for my tests, but to me it's not a good
>> setup
>> > for production. I need my system to perform well for specially for
>> writes on
>> > cassandra, but the amount of data could grow really big, taking several
>> Tb
>> > of total storage.
>> >     My first guess was using S3 as a storage and I saw this can be done
>> by
>> > using Cloudian package, but I wouldn't like to become dependent on a
>> > pre-package solution and I found it's kind of expensive for more than
>> 100Tb:
>> > http://www.cloudian.com/pricing.html
>> >     I saw some discussion at internet about using EBS or ephemeral
>> disks for
>> > storage at Amazon too.
>> >
>> >     My question is: does someone on this list have the same problem as
>> me?
>> > What are you using as solution to Cassandra's storage when running it at
>> > Amazon AWS?
>> >
>> >     Any thoughts would be highly appreciatted.
>> >
>> > Best regards,
>> > --
>> > Marcelo Elias Del Valle
>> > http://mvalle.com - @mvallebr

Reply via email to