Hello,

We have an unusual situation that I believe I've reproduced, at least 
temporarily, in a test environment.  I also think I see where this issue is 
occurring in the code.

We have a specific column family that is under heavy read and write load on a 
nightly basis.   For the purposes of this description, I'll refer to this 
column family as "Bob".  During this nightly processing, sometimes Bob is under 
very write load, other times it is very heavy read load.

The application is such that when something is written to Bob, a write is made 
to one of two other tables.  We've witnessed a situation where the write count 
on Bob far outstrips the write count on either of the other tables, by a factor 
of 3->10.  This is based on the WriteCount available on the column family JMX 
MBean.  We have not been able to find where in our code this is happening, and 
we have gone as far as tracing our CQL calls to determine that the relationship 
between Bob and the other tables are what we expect.

I brought up a test node to experiment, and see a situation where, when a 
"select" statement is executed, a write will occur.

In my test, I perform the following (switching between nodetool and cqlsh):

update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush

Then, for a period of time (before a minor compaction occurs), a select 
statement that selects specific columns will cause writes to occur in the write 
count of the column family:

select about,changed,data from bob where key='<hex key>';

This situation will continue until a minor compaction is completed.

I went into the code and added some traces to CollationController.java:

private ColumnFamily collectTimeOrderedData() { 
logger.debug("collectTimeOrderedData"); ... <snip> ... 

---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min 
compact: " + cfs.getMinimumCompactionThreshold() );
// "hoist up" the requested data into a more recent sstable if 
(sstablesIterated > cfs.getMinimumCompactionThreshold() && 
!cfs.isCompactionDisabled() && cfs.getCompactionStrategy() instanceof 
SizeTieredCompactionStrategy) { RowMutation rm = new 
RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe())); try {
---> HERE  logger.debug( "Apply hoisted up row mutation" );
// skipping commitlog and index updates is fine since we're just de-fragmenting 
existing data Table.open(rm.getTable()).apply(rm, false, false); } catch 
(IOException e) { // log and allow the result to be returned 
logger.error("Error re-writing read results", e); } } 
... <snip> ...

Performing the steps above, I see the following traces (in the test environment 
I decreased the minimum compaction threshold to make this easier to reproduce). 
 After I do a couple of update/flush, I see this in the log:

DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) 
Scheduling a background task check for bob with SizeTieredCompactionStrategy


Then, until compaction occurs, I see (when performing a select):

DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) 
Disseminating load info ...
DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) 
execute_cql_query
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL 
statement type: SELECT
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) 
Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', 
key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
columnName='null')', columns=[about,changed,data,])/ONE
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor 
is 1; setting up requests to /10.0.4.20
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading 
data locally
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) 
LocalReadRunnable reading SliceByNamesReadCommand(table='open', 
key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', 
columnName='null')', columns=[about,changed,data,])
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) 
In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter 
type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) 
collectTimeOrderedData
---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
(line 188) tables iterated: 4 Min compact: 2

----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java 
(line 198) Apply hoisted up row mutation
DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying 
mutation of row 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342

The above traces will occur every time I repeat the above select statement.

Minor compaction doesn't start until a few minutes after the request was 
submitted above (note, this is an unloaded test node):

DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 
45) Creating IntervalNode from 
[Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...


Once minor compaction occurs, the behavior around write count being incremented 
stops, until more than the minimum compaction threshold memtables are flush to 
disk.

So, my questions are:

1) Am I reading things correctly?

2) What is really happening here?  Essentially minor compactions can occur 
between 4 and 32 memtable flushes.  Looking through the code, this seems to 
only effect a couple types of select statements (when selecting a specific 
column on a specific key being one of them). During the time between these two 
values, every "select" statement will perform a write.

3) Is this desired behavior?  Is there something else I should be looking at 
that could be causing this behavior?

We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
Any help is appreciated,
Thanks,
-Mike

Reply via email to