At the data modeling class at the Cassandra Summit, the instructor said that 
lots of small partitions are just fine. I’ve heard on this list that that is 
not true, and that its better to cluster small partitions into fewer, larger 
partitions. Due to conflicting information on this issue, I’d be interested in 
hearing people’s opinions.

For the sake of discussion, lets compare two tables:

CREATE TABLE a (
id INT,
value INT,
PRIMARY KEY (id)
)

CREATE TABLE b (
bucket INT,
id INT,
value INT,
PRIMARY KEY ((bucket), id)
)

And lets say that bucket is computed as id / N. For analysis purposes, lets 
assume I have 100 million id’s to store.

Table a is obviously going to have a larger bloom filter. That’s a clear 
negative.

When I request a record, table a will have less data to load from disk, so that 
seems like a positive.

Table a will never have its columns scattered across multiple SSTables, but 
table b might. If I only want one row from a partition in table b, does 
fragmentation matter (I think probably not, but I’m not sure)?

It’s not clear to me which will fit more efficiently on disk, but I would guess 
that table a wins.

Smaller partitions means sending less data during repair, but I suspect that 
when computing the Merkle tree for the table, more partitions might mean more 
overhead, but that’s only a guess. Which one repairs more efficiently?

In your opinion, which one is best and why? If you think table b is best, what 
would you choose N to be?

Robert

Reply via email to