Hi all, My team is trying to determine whether to use size-tiered or leveled compaction for some tables in an app that will be moving to production soon. using C* 2.1.
We have a few tables that look like this: CREATE TABLE ks.counters ( id timeuuid PRIMARY KEY, count counter ) The count for a given id is updated at rates of say 1/s to 100/s My question is, if we use size-tiered compaction (and ignoring memtables and counter cache in this particular case) how many SSTables should I expect C* to read when I SELECT from this table? (Furthermore I don't mean to make the question specific to counters -- suppose the column is a bigint instead) Naively, the updates leave a trail of outdated cells in multiple SSTables until they are compacted, and those will have to be inspected in some fashion to determine which cell has the most recent value. >>Question<<: Does the read path: 1) sort the SSTables by timestamp in some fashion when reading, so that it 2) can ignore all other SSTables once it's found a more recent cell, given STCS ? I see some evidence suggesting this is true, for example if (1) holds, an old wiki page suggests (2) is done by comparing the timestamp of the cell read from the first sstable, to the max timestamp of the second SSTable. Presumably flushes and compactions work such that the timestamp of the recent update would always be greater than the max timestamp of other SSTables that have that column. Eventually I'll orient myself well enough in the C* source to answer this question for myself, but I haven't found clarity in books or web searches. Now, I'd like to extend the question. Suppose I have CREATE TABLE ks.more_counters ( id timeuuid PRIMARY KEY, count_1 counter, count_2 counter ) And suppose there are a largish number of partitions with typically 0-10 updates for each counter, spaced relatively far apart in time. Suppose now C* finds the most recent update of count_1 in the first SSTable it looks at, but the most recent update of count_2 is in the 4th SSTable back. >>Question<<: Will C* potentially have to seek to 4 SSTables in this case, or just 2? Will C* have to actually seek to the SSTables in between the first (where the most recent update of count_1 lives) and the 4th (where the most recent update of count_2 lives)? Or is there an additional optimization (in-memory indices containing per-key column name information) that tells it "nothing to see here" when it's looking for count_2? I'm looking in general to understand the read path better, but feel free to mention it if you think I'm putting too much emphasis on this very theoretical count of SSTable seeks for making the leveled/size-tiered decision. Thanks, and I look forward to any answers! --Richard