Hello, I am trying to understand the way cassandra reads data. I've been reading a lot and here is what I understand. Can I get some feedback on the following claims ? Which are right and which are wrong?
A) Upon opening an SSTTable for read, Cassandra samples one key in 100 to speed up disk access. Is the percentage configurable ? What is the relationship between this sampling and the key cache ? B) So assuming we have 200 keys in the table, the in-memory index will contain the on-disk position of keys 0 and 100. C) I want to access a key that is at the 50th position in that table, Cassandra will seek position 0 and then do a sequential read of the file from there until it finds the key, right ? While it does that, does C* deserialize the rows it is reading or does it just compare the keys' bytes and ignore the accompanying data ? D) Does the data for a key immediatly follow the row in the file ? Ex: [key0][data0][key1][data1]... ? Assuming a perfectly uniform random read pattern and no caches whatsoever (neither C*, nor the OS, nor the disks... nothing) E) Since the sampling is 1%, we'll have to scan 50 keys in the file *on average* to get to the key we want. G) Because we're scanning the file ondisk, scanning those 50 keys requires in fact to read from disk both the keys and the data row so, on average, retrieving one row requires in fact reading 50 rows from the disk thus increasing I/O fifty-fold. The keycache stores the position in the file for the keys it contains. So it's a great way to cut down on these inefficiencies. H) Going back to my previous example : if my keycache has 100 keys capacity, then I'll only have to scan the file for 1/2 the requests Real world now... I am proof-testing using SSD drives and I have too much data to hold it in memory. I have some hotspots. I) I wonder how best to allocate memory between the OS cache, key cache & row cache Any suggestions ? My read pattern is very "chunky" : I never read a single row but ranges of rows with column slices. The sizes are varying. J) I've considered writing a partitioner that will chunk the rows together so that queries for "close" rows go to the same replica on the ring. Since the rows have close keys, they will be close together in the file and this will increase OS cache efficiency. What do you think ? Thanks for your insights