First, the non helpful advice, I strongly suggest changing the data model so you do not have 100MB+ rows. They will make life harder.
> Write request latency is about 900 microsecs, read request > latency > is about 4000 microsecs. > > 4 milliseconds to drag 100 to 300 MB data off a SAN, through your network, into C* and out to the client does not sound terrible at first glance. Can you benchmark and individual request to get an idea of the throughput? I would recommend removing the SAN from the equation, cassandra will run better with local disks. It also introduces a single point of failure into a distributed system. > but it's likely in the Linux disk cache, given the sizing of the > node/data/jvm. Are you sure that the local Linux machine is going to cache files stored on the SAN ? Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/12/2012, at 6:56 AM, Yiming Sun <yiming....@gmail.com> wrote: > James, you could experiment with Row cache, with off-heap JNA cache, and see > if it helps. My own experience with row cache was not good, and the OS cache > seemed to be most useful, but in my case, our data space was big, over 10TB. > Your sequential access pattern certainly doesn't play well with LRU, but > giving the small data space you have, you may be able to fit the data from > one column family entirely into the row cache. > > > On Fri, Dec 21, 2012 at 12:03 PM, James Masson <james.mas...@opigram.com> > wrote: > > > On 21/12/12 16:27, Yiming Sun wrote: > James, using RandomPartitioner, the order of the rows is random, so when > you request these rows in "Sequential" order (sort by the date?), > Cassandra is not reading them sequentially. > > Yes, I understand the "next" row to be retrieved in sequence is likely to be > on a different node, and the ordering is random. I'm using the word > sequential to try to explain that the data being requested is in an order, > and not repeated, until the next cycle. The data is not guaranteed to be of a > size that is cache-able as a whole. > > > > The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for > each column? Or are these the total size of the entire column family? > It wasn't too clear to me. But if these are the total size of the > column families, you will be able to fit them mostly in memory, so you > should enable row cache. > > Size of the column family, on a single node. Row caching is off at the moment. > > Are you saying that I should increase the JVM heap to fit some data in the > row cache, at the expense of linux disk caching? > > Bear in mind that the data is only going to be re-requested in sequence again > - I'm not sure what the value is in the cassandra native caching if rows are > not re-requested before being evicted. > > My current key-cache hit-rates are near zero on this workload, hence I'm > interested in cassandra's zero-cache performance. Unless I can guarantee to > fit the entire data-set in memory, it's difficult to justify using memory on > a cassandra cache if LRU and workload means it's not actually a benefit. > > > > I happen to have done some performance tests of my own on cassandra, > mostly on the read, and was also only able to get less than 6MB/sec read > rate out of a cluster of 6 nodes RF2 using a single threaded client. > But it makes a huge difference when I changed the client to an > asynchronous multi-threaded structure. > > > Yes, I've been talking to the developers about having a separate thread or > two that keeps cassandra busy, keeping Disruptor > (http://lmax-exchange.github.com/disruptor/) fed to do the processing work. > > But this all doesn't change the fact that under this zero-cache workload, > cassandra seems to be very CPU expensive for throughput. > > thanks > > James M > > > > > On Fri, Dec 21, 2012 at 10:36 AM, James Masson <james.mas...@opigram.com > <mailto:james.mas...@opigram.com>> wrote: > > > Hi, > > thanks for the reply > > > On 21/12/12 14:36, Yiming Sun wrote: > > I have a few questions for you, James, > > 1. how many nodes are in your Cassandra ring? > > > 2 or 3 - depending on environment - it doesn't seem to make a > difference to throughput very much. What is a 30 minute task on a 2 > node environment is a 30 minute task on a 3 node environment. > > > 2. what is the replication factor? > > > 1 > > 3. when you say sequentially, what do you mean? what > Partitioner do you > use? > > > The data is organised by date - the keys are read sequentially in > order, only once. > > Random partitioner - the data is equally spread across the nodes to > avoid hotspots. > > > 4. how many columns per row? how much data per row? per column? > > > varies - described in the schema. > > create keyspace mykeyspace > with placement_strategy = 'SimpleStrategy' > and strategy_options = {replication_factor : 1} > and durable_writes = true; > > > create column family entities > with column_type = 'Standard' > and comparator = 'BytesType' > and default_validation_class = 'BytesType' > and key_validation_class = 'AsciiType' > and read_repair_chance = 0.0 > and dclocal_read_repair_chance = 0.0 > and gc_grace = 0 > and min_compaction_threshold = 4 > and max_compaction_threshold = 32 > and replicate_on_write = false > and compaction_strategy = > 'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy' > > and caching = 'NONE' > and column_metadata = [ > {column_name : '64656c65746564', > validation_class : BytesType, > index_name : 'deleted_idx', > index_type : 0}, > {column_name : '6576656e744964', > validation_class : TimeUUIDType, > index_name : 'eventId_idx', > index_type : 0}, > {column_name : '7061796c6f6164', > validation_class : UTF8Type}]; > > 2 columns per row here - about 200Mb of data in total > > > create column family events > with column_type = 'Standard' > and comparator = 'BytesType' > and default_validation_class = 'BytesType' > and key_validation_class = 'TimeUUIDType' > and read_repair_chance = 0.0 > and dclocal_read_repair_chance = 0.0 > and gc_grace = 0 > and min_compaction_threshold = 4 > and max_compaction_threshold = 32 > and replicate_on_write = false > and compaction_strategy = > 'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy' > > and caching = 'NONE'; > > 1 column per row - about 300Mb of data > > create column family intervals > with column_type = 'Standard' > and comparator = 'BytesType' > and default_validation_class = 'BytesType' > and key_validation_class = 'AsciiType' > and read_repair_chance = 0.0 > and dclocal_read_repair_chance = 0.0 > and gc_grace = 0 > and min_compaction_threshold = 4 > and max_compaction_threshold = 32 > and replicate_on_write = false > and compaction_strategy = > 'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy' > > and caching = 'NONE'; > > variable columns per row - about 40Mb of data. > > > > 5. what client library do you use to access Cassandra? > (Hector?). Is > your client code single threaded? > > > Hector - yes, the processing side of the client is single threaded, > but is largely waiting for cassandra responses and has plenty of CPU > headroom. > > > I guess what I'm most interested in is why the discrepancy in > between read/write latency - although I understand the data volume > is much larger in reads, even though the request rate is lower. > > Network usage on a cassandra box barely gets above 20Mbit, including > inter-cluster comms. Averages 5mbit client<>cassandra > > There is near zero disk I/O, and what little there is is served sub > 1ms. Storage is backed by a very fast SAN, but like I said earlier, > the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb > cassandra heap - GCs are nice and quick, no JVM memory problems, > used heap oscillates between 280-350Mb. > > Basically, I'm just puzzled as cassandra doesn't behave as I would > expect. Huge CPU use in cassandra for very little throughput. I'm > struggling to find anything that's wrong with the environment, > there's no bottleneck that I can see. > > thanks > > James M > > > > > > On Fri, Dec 21, 2012 at 7:27 AM, James Masson > <james.mas...@opigram.com <mailto:james.mas...@opigram.com> > <mailto:james.masson@opigram.__com > > <mailto:james.mas...@opigram.com>>> wrote: > > > Hi list-users, > > We have an application that has a relatively unusual access > pattern > in cassandra 1.1.6 > > Essentially we read an entire multi hundred megabyte column > family > sequentially (little chance of a cassandra cache hit), > perform some > operations on the data, and write the data back to another > column > family in the same keyspace. > > We do about 250 writes/sec and 100 reads/sec during this > process. > Write request latency is about 900 microsecs, read request > latency > is about 4000 microsecs. > > * First Question: Do these numbers make sense? > > read-request latency seems a little high to me, cassandra > hasn't had > a chance to cache this data, but it's likely in the Linux disk > cache, given the sizing of the node/data/jvm. > > thanks > > James M > > > >