Hi folks, Perhaps this is a question better addressed to the Cassandra developers directly, but I thought I'd ask it here first. We've recently been benchmarking certain uses of secondary indexes in Cassandra 2.1.x, and we've noticed that when the number of items in an index reaches beyond some threshold (perhaps several tens of thousands depending on the cardinality) performance begins to degrade substantially. This is particularly the case when the client does things it probably shouldn't do (like manually paginate results), but we suspect there's at least one issue in Cassandra having an impact here that we'd like to understand better.
Our investigation led us to logic in Cassandra used to paginate scans of rows in indexes on composites. The issue seems to be the short algorithm Cassandra uses to select the size of the pages for the scan, partially given on the following two lines (from o.a.c.db.index.composites.CompositesSearcher): private int meanColumns = Math.max(index.getIndexCfs().getMeanColumns(), 1); private int rowsPerQuery = Math.max(Math.min(filter.maxRows(), filter.maxColumns() / meanColumns), 2); The value computed for rowsPerQuery appears to be the page size. Based on our reading of the code, unless the value obtained for meanColumns is very small, a large query-level page size is used, or the DISTINCT keyword is used, the value for (filter.maxColumns() / meanColumns) always ends up being small enough that the page size is 2. This seems to be the case both for very low-cardinality indexes (two different indexed values) and for indexes with higher cardinalities as long as the number of entries per index row is more than a few thousand. The fact that we consistently get such a small page size appears to have a substantial impact on performance. The overhead is simply devastating, especially since it looks like the pages are likely to overlap with each other (the last element of one page is the first element of the next). To wit: if we fix the index page size in code to a very large number, index queries in our environment that prior required over two minutes to complete can finish in under ten seconds. Some (but probably not this much) overhead might be acceptable if the algorithm is intended to achieve other worthy goals (safety?). But what's puzzling to us is that we can't figure out what it's intended to do. We suspect the algorithm is simply buggy, but we'd like insight from knowledgeable parties before we draw that conclusion and try to find a different solution. Does anyone here have relevant experience with secondary indexes that might shed light on the design choice here? In particular, can anyone (perhaps the developers?) explain what this algorithm is intended to do and what we might do to safely get around this limitation? Also (to the developers watching this list): is this the sort of question we should be addressing to the dev list directly? Thanks, SK