Hi *,

we're experimenting with a 4-node Cassandra setup, aggregating input streams and persisting time-bucket based aggregation results.

One test was to create "full dumps" of a table with around 32 million existing entries. The keyspace was created with a replication factor of 2 and the CF has 40 sstables across the four nodes.

The output of "describe table mykeyspace.mycf;":

--- cut here ---
CREATE TABLE mykeyspace.mycf (
    a timestamp,
    b bigint,
    c bigint,
    d bigint,
    e boolean,
    f text,
    g bigint,
    h text,
    i bigint,
    j text,
    k text,
    l bigint,
    m text,
    n text,
    o text,
    p text,
    q text,
    r bigint,
    PRIMARY KEY ((a, b, c), d, e, f)
) WITH CLUSTERING ORDER BY (d ASC, e ASC, f ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 0
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
--- cut here ---

Cassandra is at version 2.1.2.

The test program is written in Java and using the Datastax driver 2.1.3, issuing the SimpleQuery "select * from mykeyspace.mycf;", reading each row (without further local processing) and counting the number of rows returned plus measuring the elapsed time.

Across multiple invocations, the time per row is about 10.8 milliseconds, which sums up to quite some elapsing time for all the records.

When running the query against a single column instead of all columns, the average time per record still is at a similar level.

What puzzles me most: During the query (no other activity is running on the Cassandra nodes), I see plenty of CPU idle and close to no iowait on all four nodes.

Depending on the used SimpleStatement.setFetchSize() used, we run into less or more (up to severe) GC trouble, so we settled on using a size of 500 (1000 will work, too), keeping the GC stress level low. Loooking at the Cassandra logs, I do see some GC activity (we're currently using G1GC), but mostly around a few 100ms and not too often - so it should be no issue of GC blocking. The JVMs seem to have plenty of memory left, according to jconsole.

I have tried running the query with tracing, but the amount of trace entries was overwhelming. Scanning the part that fit into the screen buffer, nothing obvious turned up.

Given the fact that the servers seem not to be under resource constraints, adding more servers will likely not help to improve the response time.

How could I shed some light on why the queries will not take the servers to their (CPU, I/O) limits?

Regards,
Jens

PS: Doing full scans seems to be an anti-pattern - selecting individual records from our tables works sufficiently fast ATM. We're looking into changes to the solution architecture so we can avoid dumping such big tables, but nevertheless I'm curious what limit we're hitting... what's restricting the flow of data?

Reply via email to