Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs out of disk space :-S. I didn't try the compression, but when it ran out of disk space, or near running out, compaction would fail because it needs space to create some tmp data files.
I shall get a tatoo that says keep it around 50% -- this is valuable tip. -- Y. On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote: > Is that 80% with compression? If not, the first thing to do is turn on > compression. Cassandra doesn't behave well when it runs out of disk space. > You really want to try and stay around 50%, 60-70% works, but only if it > is spread across multiple column families, and even then you can run into > issues when doing repairs. > > -Jeremiah > > > > On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote: > > Thanks Aaron. Well I guess it is possible the data files from > sueprcolumns could've been reduced in size after compaction. > > This bring yet another question. Say I am on a shoestring budget and > can only put together a cluster with very limited storage space. The first > iteration of pushing data into cassandra would drive the disk usage up into > the 80% range. As time goes by, there will be updates to the data, and > many columns will be overwritten. If I just push the updates in, the disks > will run out of space on all of the cluster nodes. What would be the best > way to handle such a situation if I cannot to buy larger disks? Do I need > to delete the rows/columns that are going to be updated, do a compaction, > and then insert the updates? Or is there a better way? Thanks > > -- Y. > > On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote: > >> does cassandra 1.0 perform some default compression? >> >> No. >> >> The on disk size depends to some degree on the work load. >> >> If there are a lot of overwrites or deleted you may have rows/columns >> that need to be compacted. You may have some big old SSTables that have not >> been compacted for a while. >> >> There is some overhead involved in the super columns: the super col >> name, length of the name and the number of columns. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 29/03/2012, at 9:47 AM, Yiming Sun wrote: >> >> Actually, after I read an article on cassandra 1.0 compression just now ( >> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), >> I am more puzzled. In our schema, we didn't specify any compression >> options -- does cassandra 1.0 perform some default compression? or is the >> data reduction purely because of the schema change? Thanks. >> >> -- Y. >> >> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com> wrote: >> >>> Hi, >>> >>> We are trying to estimate the amount of storage we need for a >>> production cassandra cluster. While I was doing the calculation, I noticed >>> a very dramatic difference in terms of storage space used by cassandra data >>> files. >>> >>> Our previous setup consists of a single-node cassandra 0.8.x with no >>> replication, and the data is stored using supercolumns, and the data files >>> total about 534GB on disk. >>> >>> A few weeks ago, I put together a cluster consisting of 3 nodes >>> running cassandra 1.0 with replication factor of 2, and the data is >>> flattened out and stored using regular columns. And the aggregated data >>> file size is only 488GB (would be 244GB if no replication). >>> >>> This is a very dramatic reduction in terms of storage needs, and is >>> certainly good news in terms of how much storage we need to provision. >>> However, because of the dramatic reduction, I also would like to make sure >>> it is absolutely correct before submitting it - and also get a sense of why >>> there was such a difference. -- I know cassandra 1.0 does data compression, >>> but does the schema change from supercolumn to regular column also help >>> reduce storage usage? Thanks. >>> >>> -- Y. >>> >> >> >> > >