> > What if we have 10B rows in the column family? What sort of index do you > use > > that would only require one iop to find the row index block? > > basically what is described in sections 5.3 and 5.4 here: > http://labs.google.com/papers/bigtable.html >
Incorrect. Section 4 of the paper describes the SStable indexing strategy. I have to admit, I'm surprised "3 iops [per mumbleVariable]" is a satisfactory answer to this question. What should interest any real-world user of the system would be mumbleVariable. The 3 part is nearly meaningless. It's the constant you throw away in a complexity analysis. So let's come up with a definition for worst-case reads (aka "number of SStables that might be relevent for a particular key, and what sort of slow operations might need to be executed to get to the final and proper SStable") shall we? (This is not a complexity analysis, but a similar methodology) mumbleVariable = (((numRows / rowsPerSStable) + bloomFalsePositives + tombstones ) * 3) disk seeks plus one network operation. We might shorten those variables for readability: Mv = (((N/R)+B+T)*3)+Nw With this formula, we can already begin to formulate more useful answers to the question. If I have 10B rows in my CF, and I can fit 10k rows per SStable, and the SStables are spread across 5 nodes, and I have 1 bloom filter false positive and 1 tombstone and ask the wrong node for the key, then: Mv = (((2B/10k)+1+1)*3)+1 == ((200,000)+2)*3+1 == 300,007 iops to read a key. More updates and deletes will cause more bloomFalsePositives and tombstones, right? Now that I've done this analysis, I think network operations can be thrown away, since there will only ever be 1 (or maybe 2?) per request, not depending on numRows. What might also be interesting is an average-case analysis, where we give weightings to bloomFalsePositives (which should be very rare events) and then try to include them, since they should be proportional to numRows. Is there some indexing method I'm missing in there? It seems doing 300k iops to find a key in a large CF like that would be excessive. -- timeless(ness)