Hi folks,

Perhaps this is a question better addressed to the Cassandra developers 
directly, but I thought I'd ask it here first.  We've recently been 
benchmarking certain uses of secondary indexes in Cassandra 2.1.x, and 
we've noticed that when the number of items in an index reaches beyond 
some threshold (perhaps several tens of thousands depending on the  
cardinality) performance begins to degrade substantially.  This is  
particularly the case when the client does things it probably shouldn't 
do (like manually paginate results), but we suspect there's at least  
one issue in Cassandra having an impact here that we'd like to  
understand better.

Our investigation led us to logic in Cassandra used to paginate scans 
of rows in indexes on composites.  The issue seems to be the short 
algorithm Cassandra uses to select the size of the pages for the scan, 
partially given on the following two lines (from 
o.a.c.db.index.composites.CompositesSearcher):

private int meanColumns = Math.max(index.getIndexCfs().getMeanColumns(), 1);
private int rowsPerQuery = Math.max(Math.min(filter.maxRows(), 
filter.maxColumns() / meanColumns), 2);

The value computed for rowsPerQuery appears to be the page size.

Based on our reading of the code, unless the value obtained for 
meanColumns is very small, a large query-level page size is used, or 
the DISTINCT keyword is used, the value for (filter.maxColumns() / 
meanColumns) always ends up being small enough that the page size is 
2.  This seems to be the case both for very low-cardinality indexes 
(two different indexed values) and for indexes with higher 
cardinalities as long as the number of entries per index row is more 
than a few thousand.

The fact that we consistently get such a small page size appears to 
have a substantial impact on performance.  The overhead is simply 
devastating, especially since it looks like the pages are likely to 
overlap with each other (the last element of one page is the first 
element of the next).  To wit: if we fix the index page size in code to 
a very large number, index queries in our environment that prior 
required over two minutes to complete can finish in under ten seconds.

Some (but probably not this much) overhead might be acceptable if the 
algorithm is intended to achieve other worthy goals (safety?).  But 
what's puzzling to us is that we can't figure out what it's intended to 
do.  We suspect the algorithm is simply buggy, but we'd like insight 
from knowledgeable parties before we draw that conclusion and try to 
find a different solution.

Does anyone here have relevant experience with secondary indexes that 
might shed light on the design choice here?  In particular, can anyone 
(perhaps the developers?) explain what this algorithm is intended to do 
and what we might do to safely get around this limitation?

Also (to the developers watching this list): is this the sort of 
question we should be addressing to the dev list directly?

Thanks,
SK

Reply via email to