According to your cfstats, read latency is over 100 ms which is really really 
slow. I am seeing less than 3ms reads for my cluster which is on SSD. Can you 
also check the nodetool cfhistorgram, it tells you more about the number of 
SSTable involved and read/write latency. Somtimes average doesn't tell you the 
whole storey. 
Also check your nodetool tpstats, are there a lot dropped reads?

-Wei
----- Original Message -----
From: "Jon Scarborough" <j...@fifth-aeon.net>
To: user@cassandra.apache.org
Sent: Friday, March 22, 2013 9:42:34 AM
Subject: Re: High disk I/O during reads

Key distribution across probably varies a lot from row to row in our case. Most 
reads would probably only need to look at a few SSTables, a few might need to 
look at more. 

I don't yet have a deep understanding of C* internals, but I would imagine even 
the more expensive use cases would involve something like this: 

1) Check the index for each SSTable to determine if part of the row is there. 
2) Look at the endpoints of the slice to determine if the data in a particular 
SSTable is relevant to the query. 
3) Read the chunks of those SSTables, working backwards from the end of the 
slice until enough columns have been read to satisfy the limit clause in the 
query. 

So I would have guessed that even the more expensive queries on wide rows 
typically wouldn't need to read more than a few hundred KB from disk to do all 
that. Seems like I'm missing something major. 

Here's the complete CF definition, including compression settings: 

CREATE COLUMNFAMILY conversation_text_message ( 
conversation_key bigint PRIMARY KEY 
) WITH 
comment='' AND 
comparator='CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal.AsciiType)'
 AND 
read_repair_chance=0.100000 AND 
gc_grace_seconds=864000 AND 
default_validation=text AND 
min_compaction_threshold=4 AND 
max_compaction_threshold=32 AND 
replicate_on_write=True AND 
compaction_strategy_class='SizeTieredCompactionStrategy' AND 
compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor';
 

Much thanks for any additional ideas. 

-Jon 



On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean < dean.hil...@nrel.gov > wrote: 


Did you mean to ask "are 'all' your keys spread across all SSTables"? I am 
guessing at your intention. 

I mean I would very well hope my keys are spread across all sstables or 
otherwise that sstable should not be there as he has no keys in it ;). 

And I know we had HUGE disk size from the duplication in our sstables on 
size-tiered compaction….we never ran a major compaction but after we switched 
to LCS, we went from 300G to some 120G or something like that which was nice. 
We only have 300 data point posts / second so not an extreme write load on 6 
nodes as well though these posts causes read to check authorization and such of 
our system. 

Dean 

From: Kanwar Sangha < kan...@mavenir.com <mailto: kan...@mavenir.com >> 
Reply-To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org >" < 
user@cassandra.apache.org <mailto: user@cassandra.apache.org >> 
Date: Friday, March 22, 2013 8:38 AM 
To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org >" < 
user@cassandra.apache.org <mailto: user@cassandra.apache.org >> 
Subject: RE: High disk I/O during reads 


Are your Keys spread across all SSTables ? That will cause every sstable read 
which will increase the I/O. 

What compaction are you using ? 

From: zod...@fifth-aeon.net <mailto: zod...@fifth-aeon.net > [mailto: 
zod...@fifth-aeon.net ] On Behalf Of Jon Scarborough 

Sent: 21 March 2013 23:00 
To: user@cassandra.apache.org <mailto: user@cassandra.apache.org > 


Subject: High disk I/O during reads 

Hello, 

We've had a 5-node C* cluster (version 1.1.0) running for several months. Up 
until now we've mostly been writing data, but now we're starting to service 
more read traffic. We're seeing far more disk I/O to service these reads than I 
would have anticipated. 

The CF being queried consists of chat messages. Each row represents a 
conversation between two people. Each column represents a message. The column 
key is composite, consisting of the message date and a few other bits of 
information. The CF is using compression. 

The query is looking for a maximum of 50 messages between two dates, in reverse 
order. Usually the two dates used as endpoints are 30 days ago and the current 
time. The query in Astyanax looks like this: 

ColumnList<ConversationTextMessageKey> result = 
keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE) 
.setConsistencyLevel(ConsistencyLevel.CL_QUORUM) 
.getKey(conversationKey) 
.withColumnRange( 
textMessageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), 
textMessageSerializer.makeEndpoint(startDate, 
Equality.GREATER_THAN_EQUALS).toBytes(), 
true, 
maxMessages) 
.execute() 
.getResult(); 

We're currently servicing around 30 of these queries per second. 

Here's what the cfstats for the CF look like: 

Column Family: conversation_text_message 
SSTable count: 15 
Space used (live): 211762982685 
Space used (total): 211762982685 
Number of Keys (estimate): 330118528 
Memtable Columns Count: 68063 
Memtable Data Size: 53093938 
Memtable Switch Count: 9743 
Read Count: 4313344 
Read Latency: 118.831 ms. 
Write Count: 817876950 
Write Latency: 0.023 ms. 
Pending Tasks: 0 
Bloom Filter False Postives: 6055 
Bloom Filter False Ratio: 0.00260 
Bloom Filter Space Used: 686266048 
Compacted row minimum size: 87 
Compacted row maximum size: 14530764 
Compacted row mean size: 1186 

On the C* nodes, iostat output like this is typical, and can spike to be much 
worse: 

avg-cpu: %user %nice %system %iowait %steal %idle 
1.91 0.00 2.08 30.66 0.50 64.84 

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
xvdap1 0.13 0.00 1.07 0 16 
xvdb 474.20 13524.53 25.33 202868 380 
xvdc 469.87 13455.73 30.40 201836 456 
md0 972.13 26980.27 55.73 404704 836 

Any thoughts on what could be causing read I/O to the disk from these queries? 

Much thanks! 

-Jon 


Reply via email to