This may help determining your data storage requirements ... http://btoddb-cass-storage.blogspot.com/
On 10/25/11 11:22 AM, "Mohit Anchlia" <mohitanch...@gmail.com> wrote: >On Tue, Oct 25, 2011 at 11:18 AM, Dan Hendry <dan.hendry.j...@gmail.com> >wrote: >>> 2. ... So I am going to use rotational disk for the commit log and an >>>SSD >>> for data. Does this make sense? >> >> >> >> Yes, just keep in mind however that the primary characteristic of SSDs >>is >> lower seek times which translates into faster random access. We have a >> similar Cassandra use case (time series data and comparable volumes) and >> decided the random read performance boost (unquantified in our case to >>be >> fair) was not worth the price and we went with more, larger, cheaper >>7.2k >> HDDs. >> >> >> >>> 3. What's the best way to find out how big my commitlog disk and my >>>data >>> disk has to be? The Cassandra hardware page says the Commitlog disk >>> shouldn't be big but still I need to choose a size! >> >> >> >> As of Cassandra 1.0, the commit log has an explicit size bound >>(defaulting >> to 4GB I believe). In 0.8, I dont think I have ever seen my commit log >>grow >> beyond that point but the limit should be the ammount of data you insert >> within the maximum CF timed flush period (³memtable_flush_after² >>parameter, >> to be safe, maximumum across all CFs). Any modern drive should be >> sufficient. As for the size of your data disks, that is largely >>application >> dependent, and you should be able to judge best based on your currnet >> cluster. >> >> >> >>> 4. I also noticed RAID 0 configuration is recommended for the data file >>> directory. Can anyone explain why? >> >> >> >> In comparison to RAID1/RAID1+0? For any RF > 1, Cassadra already takes >>care >> of redundancy by replicating the data across multiple nodes. Your >> applications choice of replication factor and read/write consistencies >> should be specified to tollerate a node failing (for any reason: disk >> failure, network failure, a disgruntled employee taking a sledge hammer >>to >> the box, etc). As such, what is the point of waisting your disks >>duplicating >> data on a single machine to minimize the chances of one particular type >>of >> failure when it should not matter anyways? > >It all boils down to operations cost vs hardware cost. Also consider >MTBF and how equipped you are to handle disk failures which are more >common than others. >> >> >> >> Dan >> >> >> >> From: Alexandru Sicoe [mailto:adsi...@gmail.com] >> Sent: October-25-11 8:23 >> To: user@cassandra.apache.org >> Subject: Cassandra cluster HW spec (commit log directory vs data file >> directory) >> >> >> >> Hi everyone, >> >> I am currently in the process of writing a hardware proposal for a >>Cassandra >> cluster for storing a lot of monitoring time series data. My workload is >> write intensive and my data set is extremely varied in types of >>variables >> and insertion rate for these variables (I will have to handle an order >>of 2 >> million variables coming in, each at very different rates - the >>majority of >> them will come at very low rates but there are many that will come at >>higher >> rates constant rates and a few coming in with huge spikes in rates). >>These >> variables correspond to all basic C++ types and arrays of these types. >>The >> highest insertion rates are received for basic types, out of which U32 >> variables seem to be the most prevalent (e.g. I recorded 2 million U32 >>vars >> were inserted in 8 mins of operation while 600.000 doubles and 170.000 >> strings were inserted during the same time. Note this measurement was >>only >> for a subset of the total data currently taken in). >> >> At the moment I am partitioning the data in Cassandra in 75 CFs (each CF >> corresponds to a logical partitioning of the set of variables mentioned >> before - but this partitioning is not related with the amount of data or >> rates...it is somewhat random). These 75 CFs account for ~1 million of >>the >> variables I need to store. I have a 3 node Cassandra 0.8.5 cluster (each >> node is a 4 real core with 4 GB RAM and split commit log directory and >>data >> file directory between two RAID arrays with HDDs). I can handle the >>load in >> this configuration but the average CPU usage of the Cassandra nodes is >> slightly above 50%. As I will need to add 12 more CFs (corresponding to >> another ~ 1 million variables) plus potentially other data later, it is >> clear that I need better hardware (also for the retrieval part). >> >> I am looking at Dell servers (Power Edge etc) >> >> Questions: >> >> 1. Is anyone using Dell HW for their Cassandra clusters? How do they >>behave? >> Anybody care to share their configurations or tips for buying, what to >>avoid >> etc? >> >> 2. Obviously I am going to keep to the advice on the >> http://wiki.apache.org/cassandra/CassandraHardware and split the >>commmitlog >> and data on separate disks. I was going to use SSD for commitlog but >>then >> did some more research and found out that it doesn't make sense to use >>SSDs >> for sequential appends because it won't have a performance advantage >>with >> respect to rotational media. So I am going to use rotational disk for >>the >> commit log and an SSD for data. Does this make sense? >> >> 3. What's the best way to find out how big my commitlog disk and my data >> disk has to be? The Cassandra hardware page says the Commitlog disk >> shouldn't be big but still I need to choose a size! >> >> 4. I also noticed RAID 0 configuration is recommended for the data file >> directory. Can anyone explain why? >> >> Sorry for the huge email..... >> >> Cheers, >> Alex >> >> No virus found in this incoming message. >> Checked by AVG - www.avg.com >> Version: 9.0.920 / Virus Database: 271.1.1/3972 - Release Date: 10/24/11 >> 14:35:00