Re: Integrity of batch_insert and also what about sharding?

Cliff Moon Wed, 07 Apr 2010 21:39:00 -0700

Putting cassandra's data directories on a SAN is like putting a bunch ofF1's on one of those big car carrier trucks and entering a race with thetruck. You know, since you have so much horsepower.


On 4/7/10 7:28 PM, Jason Alexander wrote:

Well, IANAITG (I Am Not An IT Guy), but outside of the normal benefits you get from a SAN 
(that you can, of course, get from other options) is that I believe our IT group likes it 
for the management aspects - they like to buy a BigAssSAN(tm) and provision storage to 
different clusters, environments, etc... I'm sure it's also heavily weighted by the fact 
that it's "the devil we know".

-----Original Message-----
From: Benjamin Black [mailto:b...@b3k.us]
Sent: Wednesday, April 07, 2010 9:05 PM
To: user@cassandra.apache.org
Subject: Re: Integrity of batch_insert and also what about sharding?

What benefit does a SAN give you?  I've generally been confused by
that approach, so I'm assuming I am missing something.

On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander
<jason.alexan...@match.com>  wrote:

FWIW, I'd love to see some guidance here too -

 From our standpoint, we'll be consolidating the various Match.com sites' 
(match.com, chemistry.com, etc...) data into a single data warehouse, running 
Cassandra. We're looking at roughly the same amounts of data (30TB's or more). 
We were assuming 3-5 big servers sitting atop a SAN. But, again, just a guess 
and following existing conventions we use for other systems.


________________________________________
From: banks [bankse...@gmail.com]
Sent: Wednesday, April 07, 2010 8:47 PM
To: user@cassandra.apache.org
Subject: Re: Integrity of batch_insert and also what about sharding?

What I'm trying to wrap my head around is what is the break even point...

If I'm going to store 30terabytes in this thing... whats optimum to give me 
performance and scalability... is it best to be running 3 powerfull nodes, 100 
smaller nodes, nodes on each web blade with 300g behind each...  ya know?  I'm 
sure there is a point where the gossip chatter becomes overwelming and ups and 
downs to each... I have not really seen a best practices document that gives 
the pro's and con's to each method of scaling.

one 64proc 90gig memory mega machine running a single node cassandra... but on 
a raid5 SAN, good? bad?  why?

30 web blades each running a cassandra node, each with 1tb local raid5 storage, 
good, bad, why?

I get that every implimentation is different, what I'm looking for is what the 
known proven optimum is for this software... and whats to be avoided because 
its a given that it dosnt work.

On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black<b...@b3k.us<mailto:b...@b3k.us>> 
 wrote:
That depends on your goals for fault tolerance and recovery time.  If
you use RAID1 (or other redundant configuration) you can tolerate disk
failure without Cassandra having to do repair.  For large data sets,
that can be a significant win.


b

On Wed, Apr 7, 2010 at 6:02 PM, 
banks<bankse...@gmail.com<mailto:bankse...@gmail.com>>  wrote:

Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
makes sense to strip for speed and let cassandra deal with redundancy, eh?


On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black<b...@b3k.us<mailto:b...@b3k.us>> 
 wrote:

On Wed, Apr 7, 2010 at 3:41 PM, 
banks<bankse...@gmail.com<mailto:bankse...@gmail.com>>  wrote:

2. each cassandra node essentially has the same datastore as all nodes,
correct?

No.  The ReplicationFactor you set determines how many copies of a
piece of data you want.  If your number of nodes is higher than your
RF, as is common, you will not have the same data on all nodes.  The
exact set of nodes to which data is replicated is determined by the
row key, placement strategy, and node tokens.

So if I've got 3 terabytes of data and 3 cassandra nodes I'm
eating 9tb on the SAN?  are there provisions for essentially sharding
across
nodes... so that each node only handles a given keyrange, if so where is
the
howto on that?

Sharding is a concept from databases that don't have native
replication and so need a term to describe what they bolt on for the
functionality.  Distribution amongst nodes based on key ranges is how
Cassandra always operates.


b

Re: Integrity of batch_insert and also what about sharding?

Reply via email to