What benefit does a SAN give you? I've generally been confused by that approach, so I'm assuming I am missing something.
On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander <jason.alexan...@match.com> wrote: > FWIW, I'd love to see some guidance here too - > > From our standpoint, we'll be consolidating the various Match.com sites' > (match.com, chemistry.com, etc...) data into a single data warehouse, running > Cassandra. We're looking at roughly the same amounts of data (30TB's or > more). We were assuming 3-5 big servers sitting atop a SAN. But, again, just > a guess and following existing conventions we use for other systems. > > > ________________________________________ > From: banks [bankse...@gmail.com] > Sent: Wednesday, April 07, 2010 8:47 PM > To: user@cassandra.apache.org > Subject: Re: Integrity of batch_insert and also what about sharding? > > What I'm trying to wrap my head around is what is the break even point... > > If I'm going to store 30terabytes in this thing... whats optimum to give me > performance and scalability... is it best to be running 3 powerfull nodes, > 100 smaller nodes, nodes on each web blade with 300g behind each... ya know? > I'm sure there is a point where the gossip chatter becomes overwelming and > ups and downs to each... I have not really seen a best practices document > that gives the pro's and con's to each method of scaling. > > one 64proc 90gig memory mega machine running a single node cassandra... but > on a raid5 SAN, good? bad? why? > > 30 web blades each running a cassandra node, each with 1tb local raid5 > storage, good, bad, why? > > I get that every implimentation is different, what I'm looking for is what > the known proven optimum is for this software... and whats to be avoided > because its a given that it dosnt work. > > On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black > <b...@b3k.us<mailto:b...@b3k.us>> wrote: > That depends on your goals for fault tolerance and recovery time. If > you use RAID1 (or other redundant configuration) you can tolerate disk > failure without Cassandra having to do repair. For large data sets, > that can be a significant win. > > > b > > On Wed, Apr 7, 2010 at 6:02 PM, banks > <bankse...@gmail.com<mailto:bankse...@gmail.com>> wrote: >> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that >> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it >> makes sense to strip for speed and let cassandra deal with redundancy, eh? >> >> >> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black >> <b...@b3k.us<mailto:b...@b3k.us>> wrote: >>> >>> On Wed, Apr 7, 2010 at 3:41 PM, banks >>> <bankse...@gmail.com<mailto:bankse...@gmail.com>> wrote: >>> > >>> > 2. each cassandra node essentially has the same datastore as all nodes, >>> > correct? >>> >>> No. The ReplicationFactor you set determines how many copies of a >>> piece of data you want. If your number of nodes is higher than your >>> RF, as is common, you will not have the same data on all nodes. The >>> exact set of nodes to which data is replicated is determined by the >>> row key, placement strategy, and node tokens. >>> >>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm >>> > eating 9tb on the SAN? are there provisions for essentially sharding >>> > across >>> > nodes... so that each node only handles a given keyrange, if so where is >>> > the >>> > howto on that? >>> > >>> >>> Sharding is a concept from databases that don't have native >>> replication and so need a term to describe what they bolt on for the >>> functionality. Distribution amongst nodes based on key ranges is how >>> Cassandra always operates. >>> >>> >>> b >> >> > >