Re: Read operations resulting in a write?

Mike Mon, 17 Dec 2012 07:01:51 -0800

Thank you Aaron, this was very helpful.

Could it be an issue that this optimization does not really take effectuntil the memtable with the hoisted data is flushed? In my simpleexample below, the same row is updated and multiple selects of the samerow will result in multiple writes to the memtable. It seems it maybepossible (although unlikely) where, if you go from a write-mostly to aread-mostly scenario, you could get into a state where you are stuckrewriting to the same memtable, and the memtable is not flushed becauseit absorbs the over-writes. I can foresee this especially if you arereading the same rows repeatedly.

I also noticed from the codepaths that if Row caching is enabled, thisoptimization will not occur. We made some changes this weekend to makethis column family more suitable to row-caching and enabled row-cachingwith a small cache. Our initial results is that it seems to havecorrected the write counts, and has increased performance quite a bit.However, are there any hidden gotcha's there because this optimizationis not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503mentions a "compaction is behind" problem. Any history on that? Icouldn't find too much information on it.


Thanks,
-Mike

On 12/16/2012 8:41 PM, aaron morton wrote:

1) Am I reading things correctly?
Yes.
If you do a read/slice by name and more than min compaction levelnodes where read the data is re-written so that the next read usesfewer SSTables.
2) What is really happening here? Essentially minor compactions canoccur between 4 and 32 memtable flushes. Looking through the code,this seems to only effect a couple types of select statements (whenselecting a specific column on a specific key being one of them).During the time between these two values, every "select" statementwill perform a write.
Yup, only for readying a row where the column names are specified.
Remember minor compaction when using SizedTiered Compaction (thedefault) works on buckets of the same size.
Imagine a row that had been around for a while and had fragments inmore than Min Compaction Threshold sstables. Say it is 3 SSTables inthe 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5SSTable reads. If this row is read it will get hoisted back up.
But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1sttier it will not hoisted.
There are a few short circuits in the SliceByName read path. One ofthem is to end the search when we know that no other SSTables containcolumns that should be considered. So if the 4 columns you readfrequently are hoisted into the 1st bucket your reads will get handledby that one bucket.
It's not every select. Just those that touched more the min compactionsstables.
3) Is this desired behavior? Is there something else I should belooking at that could be causing this behavior?
Yes.
https://issues.apache.org/jira/browse/CASSANDRA-2503

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com
On 15/12/2012, at 12:58 PM, Michael Theroux <mthero...@yahoo.com<mailto:mthero...@yahoo.com>> wrote:
Hello,
We have an unusual situation that I believe I've reproduced, at leasttemporarily, in a test environment. I also think I see where thisissue is occurring in the code.
We have a specific column family that is under heavy read and writeload on a nightly basis. For the purposes of this description, I'llrefer to this column family as "Bob". During this nightlyprocessing, sometimes Bob is under very write load, other times it isvery heavy read load.
The application is such that when something is written to Bob, awrite is made to one of two other tables. We've witnessed asituation where the write count on Bob far outstrips the write counton either of the other tables, by a factor of 3->10. This is basedon the WriteCount available on the column family JMX MBean. We havenot been able to find where in our code this is happening, and wehave gone as far as tracing our CQL calls to determine that therelationship between Bob and the other tables are what we expect.
I brought up a test node to experiment, and see a situation where,when a "select" statement is executed, a write will occur.
In my test, I perform the following (switching between nodetool andcqlsh):
update bob set 'about'='coworker' where key='<hex key>';
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';
nodetool flush
Then, for a period of time (before a minor compaction occurs), aselect statement that selects specific columns will cause writes tooccur in the write count of the column family:
select about,changed,data from bob where key='<hex key>';

This situation will continue until a minor compaction is completed.

I went into the code and added some traces to CollationController.java:

   private ColumnFamily collectTimeOrderedData()
     {
         logger.debug("collectTimeOrderedData");

       ... <snip> ...

---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: 
" + cfs.getMinimumCompactionThreshold() );
             // "hoist up" the requested data into a more recent sstable
             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
                 && !cfs.isCompactionDisabled()
                 && cfs.getCompactionStrategy() instanceof 
SizeTieredCompactionStrategy)
             {
                 RowMutation rm = new RowMutation(cfs.table.name, new 
Row(filter.key, returnCF.cloneMe()));
                 try
                 {
---> HERE             logger.debug( "Apply hoisted up row mutation" );     
                     // skipping commitlog and index updates is fine since 
we're just de-fragmenting existing data
                     Table.open(rm.getTable()).apply(rm, false, false);
                 }
                 catch (IOException e)
                 {
                     // log and allow the result to be returned
                     logger.error("Error re-writing read results", e);
                 }
             }
... <snip> ...
Performing the steps above, I see the following traces (in the testenvironment I decreased the minimum compaction threshold to make thiseasier to reproduce). After I do a couple of update/flush, I see thisin the log:
DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java(line 117) Scheduling a background task check for bob withSizeTieredCompactionStrategy
Then, until compaction occurs, I see (when performing a select):
DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java(line 86) Disseminating load info ...DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line1227) execute_cql_queryDEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line445) CQL statement type: SELECTDEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line653) Command/ConsistencyLevel isSliceByNamesReadCommand(table='open',key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,columnParent='QueryPath(columnFamilyName='bob',superColumnName='null', columnName='null')',columns=[about,changed,data,])/ONEDEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79)Blockfor is 1; setting up requests to /10.0.4.20DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line669) reading data locallyDEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line813) LocalReadRunnable reading SliceByNamesReadCommand(table='open',key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,columnParent='QueryPath(columnFamilyName='bob',superColumnName='null', columnName='null')',columns=[about,changed,data,])DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java(line 68) In get top level columns: classorg.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid:class org.apache.cassandra.db.marshal.BytesTypeDEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java(line 84) collectTimeOrderedData---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192CollationController.java (line 188) tables iterated: 4 Min compact: 2----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192CollationController.java (line 198) Apply hoisted up row mutationDEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395)applying mutation of row804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
The above traces will occur every time I repeat the above selectstatement.
Minor compaction doesn't start until a few minutes after the requestwas submitted above (note, this is an unloaded test node):
DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278IntervalNode.java (line 45) Creating IntervalNode from[Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
Once minor compaction occurs, the behavior around write count beingincremented stops, until more than the minimum compaction thresholdmemtables are flush to disk.
So, my questions are:

1) Am I reading things correctly?
2) What is really happening here? Essentially minor compactions canoccur between 4 and 32 memtable flushes. Looking through the code,this seems to only effect a couple types of select statements (whenselecting a specific column on a specific key being one of them).During the time between these two values, every "select" statementwill perform a write.
3) Is this desired behavior? Is there something else I should belooking at that could be causing this behavior?
We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.
Any help is appreciated,
Thanks,
-Mike

Re: Read operations resulting in a write?

Reply via email to