Re: Cassandra read throughput with little/no caching.

James Masson Fri, 21 Dec 2012 07:37:01 -0800

Hi,


thanks for the reply

On 21/12/12 14:36, Yiming Sun wrote:

I have a few questions for you, James,

1. how many nodes are in your Cassandra ring?

2 or 3 - depending on environment - it doesn't seem to make a differenceto throughput very much. What is a 30 minute task on a 2 nodeenvironment is a 30 minute task on a 3 node environment.

2. what is the replication factor?

3. when you say sequentially, what do you mean?  what Partitioner do you
use?

The data is organised by date - the keys are read sequentially in order,only once.

Random partitioner - the data is equally spread across the nodes toavoid hotspots.

4. how many columns per row?  how much data per row?  per column?


varies - described in the schema.

create keyspace mykeyspace
  with placement_strategy = 'SimpleStrategy'
  and strategy_options = {replication_factor : 1}
  and durable_writes = true;


create column family entities
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'AsciiType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false

and compaction_strategy ='org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'

  and caching = 'NONE'
  and column_metadata = [
    {column_name : '64656c65746564',
    validation_class : BytesType,
    index_name : 'deleted_idx',
    index_type : 0},
    {column_name : '6576656e744964',
    validation_class : TimeUUIDType,
    index_name : 'eventId_idx',
    index_type : 0},
    {column_name : '7061796c6f6164',
    validation_class : UTF8Type}];

2 columns per row here - about 200Mb of data in total


create column family events
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'TimeUUIDType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false

and compaction_strategy ='org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'

  and caching = 'NONE';

1 column per row - about 300Mb of data

create column family intervals
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'AsciiType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false

and compaction_strategy ='org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'

  and caching = 'NONE';

variable columns per row - about 40Mb of data.

5. what client library do you use to access Cassandra?  (Hector?).  Is
your client code single threaded?

Hector - yes, the processing side of the client is single threaded, butis largely waiting for cassandra responses and has plenty of CPU headroom.

I guess what I'm most interested in is why the discrepancy in betweenread/write latency - although I understand the data volume is muchlarger in reads, even though the request rate is lower.

Network usage on a cassandra box barely gets above 20Mbit, includinginter-cluster comms. Averages 5mbit client<>cassandra

There is near zero disk I/O, and what little there is is served sub 1ms.Storage is backed by a very fast SAN, but like I said earlier, thedataset just about fits in the Linux disk cache. 2Gb VM, 512Mb cassandraheap - GCs are nice and quick, no JVM memory problems, used heaposcillates between 280-350Mb.

Basically, I'm just puzzled as cassandra doesn't behave as I wouldexpect. Huge CPU use in cassandra for very little throughput. I'mstruggling to find anything that's wrong with the environment, there'sno bottleneck that I can see.


thanks

James M



On Fri, Dec 21, 2012 at 7:27 AM, James Masson <james.mas...@opigram.com
<mailto:james.mas...@opigram.com>> wrote:


    Hi list-users,

    We have an application that has a relatively unusual access pattern
    in cassandra 1.1.6

    Essentially we read an entire multi hundred megabyte column family
    sequentially (little chance of a cassandra cache hit), perform some
    operations on the data, and write the data back to another column
    family in the same keyspace.

    We do about 250 writes/sec and 100 reads/sec during this process.
    Write request latency is about 900 microsecs, read request latency
    is about 4000 microsecs.

    * First Question: Do these numbers make sense?

    read-request latency seems a little high to me, cassandra hasn't had
    a chance to cache this data, but it's likely in the Linux disk
    cache, given the sizing of the node/data/jvm.

    thanks

    James M

Re: Cassandra read throughput with little/no caching.

Reply via email to