m1.xlarge ( total ephemeral volume size 1.7TB ) is the most widely used
node configuration for Cassandra on EC2 according to this DataStax
presentation : http://www.slideshare.net/mattdennis/cassandra-on-ec2 .

That said, I'm going with 40 for sstable_size_in_mb. My logic is as follows
:

we load around 10+ GB in updates daily using SSTable loader, and with a 24
node cluster, that is about 500 MB per node per day. We have around 5 large
column families and so the amount of newly flushed unleveled SSTable data
per node per column-family per day is around 100 MB on average. Compaction
triggers once every 2 days at least ( when unleveled data per CF exceeds
40*4 = 160 MB ) so there should be about 32 unleveled SSTables per
column-family per node on average, a manageable number. If I reduce
sstable_size_in_mb to 5 MB, I will have 8 times as many SSTable files on
disk (even though all of them will be leveled immediately after flushing)
and I don't know how that impacts I/O performance / number of file-sockets
kept open for serving reads.


On Wed, Sep 18, 2013 at 3:38 PM, Hiller, Dean <dean.hil...@nrel.gov> wrote:

> Sorry, bad bad typo…..300G is what I meant.
>
> Cassandra heavily advises to stay under 1T per node or you run into big
> troubles and most people stay under 500G per node.
>
> Later,
> Dean
>
> From: Jayadev Jayaraman <jdisal...@gmail.com<mailto:jdisal...@gmail.com>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Date: Wednesday, September 18, 2013 1:30 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
> Subject: Re: What is the ideal value for sstable_size_in_mb when using
> LeveledCompactionStrategy ?
>
> Thanks for the quick reply. We've already upped the ulimit as high as our
> Linux distro allows us to ( around 1.8 million  ).
>
> I have a follow-up question. I see that the size of individual nodes in
> your use case is quite massive. Does the safe number vary widely based on
> differences in underlying hardware, or would you say from experience that
> something around 50M for medium to large datasets ( with upped
> file-descriptor limits ) is safe for most medium-sized (1 - 5 TB per node)
> to high-end (hundreds of TB) hardware ?
>
>
> On Wed, Sep 18, 2013 at 3:15 PM, Hiller, Dean <dean.hil...@nrel.gov
> <mailto:dean.hil...@nrel.gov>> wrote:
>  1.  Always in cassandra up your file descriptor limits on linux and even
> in 0.7 that was the recommendation so cassandra could open tons of files
>  2.  We use 50M for our LCS with no performance issues.  We had it 10M on
> our previous with no issues but a huge amount of files of course with our
> 300T per node.
>
> Dean
>
> From: Jayadev Jayaraman <jdisal...@gmail.com<mailto:jdisal...@gmail.com
> ><mailto:jdisal...@gmail.com<mailto:jdisal...@gmail.com>>>
> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org
> ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Date: Wednesday, September 18, 2013 1:02 PM
> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" <
> user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:
> user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
> Subject: What is the ideal value for sstable_size_in_mb when using
> LeveledCompactionStrategy ?
>
> We have set up a 24 node (m1.xlarge nodes, 1.7 TB per node) cassandra
> cluster on Amazon EC2 :
>
> version=1.2.9
> replication factor = 2
> snitch=EC2Snitch
> placement_strategy=NetworkTopologyStrategy (with 12 nodes each in 2
> availability zones)
>
> Background on our use-case :
>
> We plan on using hadoop with sstableloader to load 10GB+ of analytics data
> per day ( 100 million+ row keys, 5 or so columns per day on average.) . We
> have chosen LeveledCompactionStrategy in the hope that it constrains the
> number of SSTables that are read in order to retrieve a sliced-predicate
> for a row. We don't want too many file-sockets ( > 1000) open to SSTables
> by the Cassandra JVM as this has caused us network / unreachability issues
> before. We faced this when we were on cassandra 0.8.9 and we were using
> SizeTieredCompactionStrategy and in order to mitigate this, we ran minor
> compaction daily and major compaction semi-regularly to ensure as few
> SSTable files as possible on disk.
>
>
>
>
>
> If we use LeveledCompactionStrategy with a small value for
> sstable_size_in_mb ( default = 5 MB ) , wouldn't that result in a very
> large number of SSTable files on disk ? How does that affect the number of
> file-sockets open (reading the docs, I get the impression that the number
> of SSTable seeks per query is reduced by a large margin) ? But if we use a
> larger value for sstable_size_in_mb, say around 200 MB, there will be 800
> MB of small uncompacted SSTables on disk per column-family to which there
> will inevitably be file-sockets open.
>
> All in all, can someone help us figure out what we should set the value of
> sstable_size_in_mb to ? I figure it's not a very good idea to set it to a
> larger value but I don't know how things perform if we set it to a small
> value. Do we have to run major compaction regularly in this case too ?
>
> Thanks
> Jayadev
>
>
>
>

Reply via email to