We have set up a 24 node (m1.xlarge nodes, 1.7 TB per node) cassandra
cluster on Amazon EC2 :

version=1.2.9
replication factor = 2
snitch=EC2Snitch
placement_strategy=NetworkTopologyStrategy (with 12 nodes each in 2
availability zones)

Background on our use-case :

We plan on using hadoop with sstableloader to load 10GB+ of analytics data
per day ( 100 million+ row keys, 5 or so columns per day on average.) . We
have chosen LeveledCompactionStrategy in the hope that it constrains the
number of SSTables that are read in order to retrieve a sliced-predicate
for a row. We don't want too many file-sockets ( > 1000) open to SSTables
by the Cassandra JVM as this has caused us network / unreachability issues
before. We faced this when we were on cassandra 0.8.9 and we were using
SizeTieredCompactionStrategy and in order to mitigate this, we ran minor
compaction daily and major compaction semi-regularly to ensure as few
SSTable files as possible on disk.





If we use LeveledCompactionStrategy with a small value for
sstable_size_in_mb ( default = 5 MB ) , wouldn't that result in a very
large number of SSTable files on disk ? How does that affect the number of
file-sockets open (reading the docs, I get the impression that the number
of SSTable seeks per query is reduced by a large margin) ? But if we use a
larger value for sstable_size_in_mb, say around 200 MB, there will be 800
MB of small uncompacted SSTables on disk per column-family to which there
will inevitably be file-sockets open.

All in all, can someone help us figure out what we should set the value of
sstable_size_in_mb to ? I figure it's not a very good idea to set it to a
larger value but I don't know how things perform if we set it to a small
value. Do we have to run major compaction regularly in this case too ?

Thanks
Jayadev

Reply via email to