Re: data size difference between supercolumn and regular column

Jeremiah Jordan Sun, 01 Apr 2012 20:26:01 -0700

Is that 80% with compression?  If not, the first thing to do is turn on 
compression.  Cassandra doesn't behave well when it runs out of disk space.  
You really want to try and stay around 50%,  60-70% works, but only if it is 
spread across multiple column families, and even then you can run into issues 
when doing repairs.


-Jeremiah


On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:

Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns 
could've been reduced in size after compaction.

This bring yet another question.  Say I am on a shoestring budget and can only 
put together a cluster with very limited storage space.  The first iteration of 
pushing data into cassandra would drive the disk usage up into the 80% range.  
As time goes by, there will be updates to the data, and many columns will be 
overwritten.  If I just push the updates in, the disks will run out of space on 
all of the cluster nodes.  What would be the best way to handle such a 
situation if I cannot to buy larger disks? Do I need to delete the rows/columns 
that are going to be updated, do a compaction, and then insert the updates?  Or 
is there a better way?  Thanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton 
<aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>> wrote:
does cassandra 1.0 perform some default compression?
No.

The on disk size depends to some degree on the work load.

If there are a lot of overwrites or deleted you may have rows/columns that need 
to be compacted. You may have some big old SSTables that have not been 
compacted for a while.

There is some overhead involved in the super columns: the super col name, 
length of the name and the number of columns.

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com<http://www.thelastpickle.com/>

On 29/03/2012, at 9:47 AM, Yiming Sun wrote:

Actually, after I read an article on cassandra 1.0 compression just now ( 
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I am 
more puzzled.  In our schema, we didn't specify any compression options -- does 
cassandra 1.0 perform some default compression? or is the data reduction purely 
because of the schema change?  Thanks.

-- Y.

On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun 
<yiming....@gmail.com<mailto:yiming....@gmail.com>> wrote:
Hi,

We are trying to estimate the amount of storage we need for a production 
cassandra cluster.  While I was doing the calculation, I noticed a very 
dramatic difference in terms of storage space used by cassandra data files.

Our previous setup consists of a single-node cassandra 0.8.x with no 
replication, and the data is stored using supercolumns, and the data files 
total about 534GB on disk.

A few weeks ago, I put together a cluster consisting of 3 nodes running 
cassandra 1.0 with replication factor of 2, and the data is flattened out and 
stored using regular columns.  And the aggregated data file size is only 488GB 
(would be 244GB if no replication).

This is a very dramatic reduction in terms of storage needs, and is certainly 
good news in terms of how much storage we need to provision.  However, because 
of the dramatic reduction, I also would like to make sure it is absolutely 
correct before submitting it - and also get a sense of why there was such a 
difference. -- I know cassandra 1.0 does data compression, but does the schema 
change from supercolumn to regular column also help reduce storage usage?  
Thanks.

-- Y.

Re: data size difference between supercolumn and regular column

Reply via email to