Hi,

thanks for the reply

On 21/12/12 14:36, Yiming Sun wrote:
I have a few questions for you, James,

1. how many nodes are in your Cassandra ring?

2 or 3 - depending on environment - it doesn't seem to make a difference to throughput very much. What is a 30 minute task on a 2 node environment is a 30 minute task on a 3 node environment.

2. what is the replication factor?

1

3. when you say sequentially, what do you mean?  what Partitioner do you
use?

The data is organised by date - the keys are read sequentially in order, only once.

Random partitioner - the data is equally spread across the nodes to avoid hotspots.

4. how many columns per row?  how much data per row?  per column?

varies - described in the schema.

create keyspace mykeyspace
  with placement_strategy = 'SimpleStrategy'
  and strategy_options = {replication_factor : 1}
  and durable_writes = true;


create column family entities
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'AsciiType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'NONE'
  and column_metadata = [
    {column_name : '64656c65746564',
    validation_class : BytesType,
    index_name : 'deleted_idx',
    index_type : 0},
    {column_name : '6576656e744964',
    validation_class : TimeUUIDType,
    index_name : 'eventId_idx',
    index_type : 0},
    {column_name : '7061796c6f6164',
    validation_class : UTF8Type}];

2 columns per row here - about 200Mb of data in total


create column family events
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'TimeUUIDType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'NONE';

1 column per row - about 300Mb of data

create column family intervals
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'AsciiType'
  and read_repair_chance = 0.0
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 0
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = false
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'NONE';

variable columns per row - about 40Mb of data.


5. what client library do you use to access Cassandra?  (Hector?).  Is
your client code single threaded?

Hector - yes, the processing side of the client is single threaded, but is largely waiting for cassandra responses and has plenty of CPU headroom.


I guess what I'm most interested in is why the discrepancy in between read/write latency - although I understand the data volume is much larger in reads, even though the request rate is lower.

Network usage on a cassandra box barely gets above 20Mbit, including inter-cluster comms. Averages 5mbit client<>cassandra

There is near zero disk I/O, and what little there is is served sub 1ms. Storage is backed by a very fast SAN, but like I said earlier, the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb cassandra heap - GCs are nice and quick, no JVM memory problems, used heap oscillates between 280-350Mb.

Basically, I'm just puzzled as cassandra doesn't behave as I would expect. Huge CPU use in cassandra for very little throughput. I'm struggling to find anything that's wrong with the environment, there's no bottleneck that I can see.

thanks

James M





On Fri, Dec 21, 2012 at 7:27 AM, James Masson <james.mas...@opigram.com
<mailto:james.mas...@opigram.com>> wrote:


    Hi list-users,

    We have an application that has a relatively unusual access pattern
    in cassandra 1.1.6

    Essentially we read an entire multi hundred megabyte column family
    sequentially (little chance of a cassandra cache hit), perform some
    operations on the data, and write the data back to another column
    family in the same keyspace.

    We do about 250 writes/sec and 100 reads/sec during this process.
    Write request latency is about 900 microsecs, read request latency
    is about 4000 microsecs.

    * First Question: Do these numbers make sense?

    read-request latency seems a little high to me, cassandra hasn't had
    a chance to cache this data, but it's likely in the Linux disk
    cache, given the sizing of the node/data/jvm.

    thanks

    James M


Reply via email to