Re: Integrity of batch_insert and also what about sharding?

Benjamin Black Wed, 07 Apr 2010 19:05:29 -0700

What benefit does a SAN give you?  I've generally been confused by
that approach, so I'm assuming I am missing something.


On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander
<jason.alexan...@match.com> wrote:
> FWIW, I'd love to see some guidance here too -
>
> From our standpoint, we'll be consolidating the various Match.com sites' 
> (match.com, chemistry.com, etc...) data into a single data warehouse, running 
> Cassandra. We're looking at roughly the same amounts of data (30TB's or 
> more). We were assuming 3-5 big servers sitting atop a SAN. But, again, just 
> a guess and following existing conventions we use for other systems.
>
>
> ________________________________________
> From: banks [bankse...@gmail.com]
> Sent: Wednesday, April 07, 2010 8:47 PM
> To: user@cassandra.apache.org
> Subject: Re: Integrity of batch_insert and also what about sharding?
>
> What I'm trying to wrap my head around is what is the break even point...
>
> If I'm going to store 30terabytes in this thing... whats optimum to give me 
> performance and scalability... is it best to be running 3 powerfull nodes, 
> 100 smaller nodes, nodes on each web blade with 300g behind each...  ya know? 
>  I'm sure there is a point where the gossip chatter becomes overwelming and 
> ups and downs to each... I have not really seen a best practices document 
> that gives the pro's and con's to each method of scaling.
>
> one 64proc 90gig memory mega machine running a single node cassandra... but 
> on a raid5 SAN, good? bad?  why?
>
> 30 web blades each running a cassandra node, each with 1tb local raid5 
> storage, good, bad, why?
>
> I get that every implimentation is different, what I'm looking for is what 
> the known proven optimum is for this software... and whats to be avoided 
> because its a given that it dosnt work.
>
> On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black 
> <b...@b3k.us<mailto:b...@b3k.us>> wrote:
> That depends on your goals for fault tolerance and recovery time.  If
> you use RAID1 (or other redundant configuration) you can tolerate disk
> failure without Cassandra having to do repair.  For large data sets,
> that can be a significant win.
>
>
> b
>
> On Wed, Apr 7, 2010 at 6:02 PM, banks 
> <bankse...@gmail.com<mailto:bankse...@gmail.com>> wrote:
>> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
>> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
>> makes sense to strip for speed and let cassandra deal with redundancy, eh?
>>
>>
>> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black 
>> <b...@b3k.us<mailto:b...@b3k.us>> wrote:
>>>
>>> On Wed, Apr 7, 2010 at 3:41 PM, banks 
>>> <bankse...@gmail.com<mailto:bankse...@gmail.com>> wrote:
>>> >
>>> > 2. each cassandra node essentially has the same datastore as all nodes,
>>> > correct?
>>>
>>> No.  The ReplicationFactor you set determines how many copies of a
>>> piece of data you want.  If your number of nodes is higher than your
>>> RF, as is common, you will not have the same data on all nodes.  The
>>> exact set of nodes to which data is replicated is determined by the
>>> row key, placement strategy, and node tokens.
>>>
>>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm
>>> > eating 9tb on the SAN?  are there provisions for essentially sharding
>>> > across
>>> > nodes... so that each node only handles a given keyrange, if so where is
>>> > the
>>> > howto on that?
>>> >
>>>
>>> Sharding is a concept from databases that don't have native
>>> replication and so need a term to describe what they bolt on for the
>>> functionality.  Distribution amongst nodes based on key ranges is how
>>> Cassandra always operates.
>>>
>>>
>>> b
>>
>>
>
>

Re: Integrity of batch_insert and also what about sharding?

Reply via email to