RAID may be less valuable to you here. More useful to you would be to split the storage according to http://wiki.apache.org/cassandra/CassandraHardware
When Cassandra is accessing effectively random parts of a large data store, expect it to be constantly hitting certain "always hot" parts of files, and doing random reads on others. The "hot" data is generally cached by your OS automatically. When Cassandra is handling many insertions (changes) or deletions, expect it to do bulk file streaming. These two types of activity are easy to split apart, which can have a tremendous benefit of dividing access patterns between streaming and random access. From the literature so far, this will usually be more effective than trying to increase aggregate disk performance with both types of data on the same physical storage. On Tue, May 11, 2010 at 9:57 AM, Wayne <wav...@gmail.com> wrote: > I am evaluating Cassandra, and Read latency is the biggest concern in terms > of performance. As I test various scenarios and configurations I am getting > surprising results. I have a 2 node cluster with both nodes connected to > direct attached storage. The read latency pulling data off the raid 10 > storage is worse than off of the internal drive. The drives are of the same > sata 7200 rpm speed, and this does not make sense. This is for single, > isolated requests, obviously in scale the RAID should perform better... I > have not started testing concurrent reads in scale as the single reads are > too slow to begin with. I am getting 20-30ms response time off of internal > drives and 50-70 ms response time through the raid volumes (as reported in > cfstats). The system is totally idle and all data has been cleanly > compacted. These both seem very high numbers. All cache as been turned off > for testing as we expect our cache hit ratio to not be that good. More > spindles usually speeds things up, but I am seeing the opposite. I am using > default settings for configuration. My write latency is very good and in > line with what I see in terms of posted benchmarks. > > What are the recommended solutions to reduce read latency in terms of CF > definition, cassandra configuration, hardware, etc? > Do more keyspaces & column families increase latency (I originally saw 3-5 > ms read latency with a small amount of data and 1 Keyspace/CF)? > Shouldn't RAID 10 help overall latency and throughput (more, faster disks > are better)? > What is a "normal" expected read latency with no cache? > I am using super columns, would read latency and overall performance be > faster to use a compound column instead? > I have many different CF to isolate different data (some with the same key), > would I be better served to combine CFs and thereby reduce the number of CFs > and possibly increase key cache hits (at the cost of bigger rows)? I am > testing with 10 Keyspaces and 6 CFs each. > > Any recommendations would be appreciated. > > Thanks. >