Checked tpstats, there are very few dropped messages. Checked histograms. Mostly nothing surprising. The vast majority of rows are small, and most reads only access one or two SSTables.
What I did discover is that of our 5 nodes, one is performing well, with disk I/O in the ballprk that seems reasonable. The other 4 nodes are doing roughly 4x the disk i/O per second. Interestingly, the node that is performing well also seems to be servicing about twice the number of reads that the other nodes are. I compared configuration between the node performing well to those that aren't, and so far haven't found any discrepancies. On Fri, Mar 22, 2013 at 10:43 AM, Wei Zhu <wz1...@yahoo.com> wrote: > According to your cfstats, read latency is over 100 ms which is really > really slow. I am seeing less than 3ms reads for my cluster which is on > SSD. Can you also check the nodetool cfhistorgram, it tells you more about > the number of SSTable involved and read/write latency. Somtimes average > doesn't tell you the whole storey. > Also check your nodetool tpstats, are there a lot dropped reads? > > -Wei > ----- Original Message ----- > From: "Jon Scarborough" <j...@fifth-aeon.net> > To: user@cassandra.apache.org > Sent: Friday, March 22, 2013 9:42:34 AM > Subject: Re: High disk I/O during reads > > Key distribution across probably varies a lot from row to row in our case. > Most reads would probably only need to look at a few SSTables, a few might > need to look at more. > > I don't yet have a deep understanding of C* internals, but I would imagine > even the more expensive use cases would involve something like this: > > 1) Check the index for each SSTable to determine if part of the row is > there. > 2) Look at the endpoints of the slice to determine if the data in a > particular SSTable is relevant to the query. > 3) Read the chunks of those SSTables, working backwards from the end of > the slice until enough columns have been read to satisfy the limit clause > in the query. > > So I would have guessed that even the more expensive queries on wide rows > typically wouldn't need to read more than a few hundred KB from disk to do > all that. Seems like I'm missing something major. > > Here's the complete CF definition, including compression settings: > > CREATE COLUMNFAMILY conversation_text_message ( > conversation_key bigint PRIMARY KEY > ) WITH > comment='' AND > comparator='CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal.AsciiType)' > AND > read_repair_chance=0.100000 AND > gc_grace_seconds=864000 AND > default_validation=text AND > min_compaction_threshold=4 AND > max_compaction_threshold=32 AND > replicate_on_write=True AND > compaction_strategy_class='SizeTieredCompactionStrategy' AND > > compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor'; > > Much thanks for any additional ideas. > > -Jon > > > > On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean < dean.hil...@nrel.gov > > wrote: > > > Did you mean to ask "are 'all' your keys spread across all SSTables"? I am > guessing at your intention. > > I mean I would very well hope my keys are spread across all sstables or > otherwise that sstable should not be there as he has no keys in it ;). > > And I know we had HUGE disk size from the duplication in our sstables on > size-tiered compaction….we never ran a major compaction but after we > switched to LCS, we went from 300G to some 120G or something like that > which was nice. We only have 300 data point posts / second so not an > extreme write load on 6 nodes as well though these posts causes read to > check authorization and such of our system. > > Dean > > From: Kanwar Sangha < kan...@mavenir.com <mailto: kan...@mavenir.com >> > Reply-To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org>" < > user@cassandra.apache.org <mailto: user@cassandra.apache.org >> > Date: Friday, March 22, 2013 8:38 AM > To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org >" < > user@cassandra.apache.org <mailto: user@cassandra.apache.org >> > Subject: RE: High disk I/O during reads > > > Are your Keys spread across all SSTables ? That will cause every sstable > read which will increase the I/O. > > What compaction are you using ? > > From: zod...@fifth-aeon.net <mailto: zod...@fifth-aeon.net > [mailto: > zod...@fifth-aeon.net ] On Behalf Of Jon Scarborough > > Sent: 21 March 2013 23:00 > To: user@cassandra.apache.org <mailto: user@cassandra.apache.org > > > > Subject: High disk I/O during reads > > Hello, > > We've had a 5-node C* cluster (version 1.1.0) running for several months. > Up until now we've mostly been writing data, but now we're starting to > service more read traffic. We're seeing far more disk I/O to service these > reads than I would have anticipated. > > The CF being queried consists of chat messages. Each row represents a > conversation between two people. Each column represents a message. The > column key is composite, consisting of the message date and a few other > bits of information. The CF is using compression. > > The query is looking for a maximum of 50 messages between two dates, in > reverse order. Usually the two dates used as endpoints are 30 days ago and > the current time. The query in Astyanax looks like this: > > ColumnList<ConversationTextMessageKey> result = > keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE) > .setConsistencyLevel(ConsistencyLevel.CL_QUORUM) > .getKey(conversationKey) > .withColumnRange( > textMessageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), > textMessageSerializer.makeEndpoint(startDate, > Equality.GREATER_THAN_EQUALS).toBytes(), > true, > maxMessages) > .execute() > .getResult(); > > We're currently servicing around 30 of these queries per second. > > Here's what the cfstats for the CF look like: > > Column Family: conversation_text_message > SSTable count: 15 > Space used (live): 211762982685 > Space used (total): 211762982685 > Number of Keys (estimate): 330118528 > Memtable Columns Count: 68063 > Memtable Data Size: 53093938 > Memtable Switch Count: 9743 > Read Count: 4313344 > Read Latency: 118.831 ms. > Write Count: 817876950 > Write Latency: 0.023 ms. > Pending Tasks: 0 > Bloom Filter False Postives: 6055 > Bloom Filter False Ratio: 0.00260 > Bloom Filter Space Used: 686266048 > Compacted row minimum size: 87 > Compacted row maximum size: 14530764 > Compacted row mean size: 1186 > > On the C* nodes, iostat output like this is typical, and can spike to be > much worse: > > avg-cpu: %user %nice %system %iowait %steal %idle > 1.91 0.00 2.08 30.66 0.50 64.84 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > xvdap1 0.13 0.00 1.07 0 16 > xvdb 474.20 13524.53 25.33 202868 380 > xvdc 469.87 13455.73 30.40 201836 456 > md0 972.13 26980.27 55.73 404704 836 > > Any thoughts on what could be causing read I/O to the disk from these > queries? > > Much thanks! > > -Jon > > >