Re: Backups, Snapshots, SSTable Data Files, Compaction

Benjamin Coverston Tue, 07 Jun 2011 09:17:13 -0700

Hi AJ,

Unfortunately, for storage capacity planning it's a bit of a guessinggame. Until you run your load against it and profile the usage you justare not going to know for sure. I have seen cases where planning to have50% excess capacity/node was plenty, and I have seen other extreme caseswhere 3x planned capacity was not enough when replica counts and entropylevels are high.

Cassandra will _try_ to work within the resource restrictions that yougive it, but keep in mind that if it has excess resources in terms ofdisk space it may be a bit more lazy than you would expect in gettingrid of some of the extra files that are sitting around waiting to bedeleted. You know if they are scheduled to be deleted have a .compactedmarker. If you want to actually SEE this happen use the stress.java orstress.py tools and do several test runs with different workloads. Ithink actually watching it happen would be enlightening for you.

Lastly while I have seen a few instances where people have chosen to usenode sizes with 10's of TB, it is an unusual case. Most node sizing Ihave seen falls in the range of 20-250GB. Not to say that there aren'tworkloads where having many TB/Node doesn't work, but if you're planningto read from the data you're writing you do want to ensure that yourworking set is stored in memory.


HTH,
Ben


On 6/7/11 9:14 AM, AJ wrote:

On 6/7/2011 2:29 AM, Maki Watanabe wrote:
You can find useful information in:
http://www.datastax.com/docs/0.8/operations/scheduled_tasks

sstables are immutable. Once it written to disk, it won't be updated.
When you take snapshot, the tool makes hard links to sstable files.
After certain time, you will have some times of memtable flushs, so
your sstable files will be merged, and obsolete sstable files will be
removed. But snapshot set will remains on your disk, for backup.
Thanks for the doc source. I will be experimenting with 0.8.0 sinceit has many features I've been waiting for.
But, still, if the snapshots don't link to all of the previous sets of.db files, then those unlinked previous file sets MUST be safe tomanually delete. But, they aren't deleted until later after a GC.It's a bit confusing why they are kept after compaction up until GCwhen they seem to not be needed. We have Big Data plans... one nodecan have 10's of TBs, so I'm trying to get an idea of how much diskspace will be required and whether or not I can free-up some disk space.
Hopefully someone can still elaborate on this.


--
Ben Coverston
Director of Operations
DataStax -- The Apache Cassandra Company
http://www.datastax.com/

Re: Backups, Snapshots, SSTable Data Files, Compaction

Reply via email to