Hello,

I am trying to understand the way cassandra reads data. I've been reading a
lot and here is what I understand.
Can I get some feedback on the following claims ? Which are right and which
are wrong?

A) Upon opening an SSTTable for read, Cassandra samples one key in 100 to
speed up disk access.
Is the percentage configurable ?
What is the relationship between this sampling and the key cache ?

B) So assuming we have 200 keys in the table, the in-memory index will
contain the on-disk position of keys 0 and 100.

C) I want to access a key that is at the 50th position in that table,
Cassandra will seek position 0 and then do a sequential read of the file
from there until it finds the key, right ?
While it does that, does C* deserialize the rows it is reading or does it
just compare the keys' bytes and ignore the accompanying data ?

D) Does the data for a key immediatly follow the row in the file ?
Ex: [key0][data0][key1][data1]... ?

Assuming a perfectly uniform random read pattern and no caches whatsoever
(neither C*, nor the OS, nor the disks... nothing)

E) Since the sampling is 1%, we'll have to scan 50 keys in the file *on
average* to get to the key we want.

G) Because we're scanning the file ondisk, scanning those 50 keys requires
in fact to read from disk both the keys and the data row so, on average,
retrieving one row requires in fact reading 50 rows from the disk thus
increasing I/O fifty-fold.

The keycache stores the position in the file for the keys it contains. So
it's a great way to cut down on these inefficiencies.
H) Going back to my previous example : if my keycache has 100 keys capacity,
then I'll only have to scan the file for 1/2 the requests

Real world now... I am proof-testing using SSD drives and I have too much
data to hold it in memory. I have some hotspots.
I) I wonder how best to allocate memory between the OS cache, key cache &
row cache
Any suggestions ?

My read pattern is very "chunky" : I never read a single row but ranges of
rows with column slices. The sizes are varying.
J) I've considered writing a partitioner that will chunk the rows together
so that queries for "close" rows go to the same replica on the ring. Since
the rows have close keys, they will be close together in the file and this
will increase OS cache efficiency.
What do you think ?

Thanks for your insights

Reply via email to