Thanks Aaron. Well I guess it is possible the data files from sueprcolumns could've been reduced in size after compaction.
This bring yet another question. Say I am on a shoestring budget and can only put together a cluster with very limited storage space. The first iteration of pushing data into cassandra would drive the disk usage up into the 80% range. As time goes by, there will be updates to the data, and many columns will be overwritten. If I just push the updates in, the disks will run out of space on all of the cluster nodes. What would be the best way to handle such a situation if I cannot to buy larger disks? Do I need to delete the rows/columns that are going to be updated, do a compaction, and then insert the updates? Or is there a better way? Thanks -- Y. On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote: > does cassandra 1.0 perform some default compression? > > No. > > The on disk size depends to some degree on the work load. > > If there are a lot of overwrites or deleted you may have rows/columns that > need to be compacted. You may have some big old SSTables that have not been > compacted for a while. > > There is some overhead involved in the super columns: the super col name, > length of the name and the number of columns. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 29/03/2012, at 9:47 AM, Yiming Sun wrote: > > Actually, after I read an article on cassandra 1.0 compression just now ( > http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), > I am more puzzled. In our schema, we didn't specify any compression > options -- does cassandra 1.0 perform some default compression? or is the > data reduction purely because of the schema change? Thanks. > > -- Y. > > On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com> wrote: > >> Hi, >> >> We are trying to estimate the amount of storage we need for a production >> cassandra cluster. While I was doing the calculation, I noticed a very >> dramatic difference in terms of storage space used by cassandra data files. >> >> Our previous setup consists of a single-node cassandra 0.8.x with no >> replication, and the data is stored using supercolumns, and the data files >> total about 534GB on disk. >> >> A few weeks ago, I put together a cluster consisting of 3 nodes running >> cassandra 1.0 with replication factor of 2, and the data is flattened out >> and stored using regular columns. And the aggregated data file size is >> only 488GB (would be 244GB if no replication). >> >> This is a very dramatic reduction in terms of storage needs, and is >> certainly good news in terms of how much storage we need to provision. >> However, because of the dramatic reduction, I also would like to make sure >> it is absolutely correct before submitting it - and also get a sense of why >> there was such a difference. -- I know cassandra 1.0 does data compression, >> but does the schema change from supercolumn to regular column also help >> reduce storage usage? Thanks. >> >> -- Y. >> > > >