RE: cassandra disk usage

Stu Hood Mon, 30 Aug 2010 09:22:17 -0700

Also, see: https://issues.apache.org/jira/browse/CASSANDRA-1207

-----Original Message-----
From: "Terje Marthinussen" <tmarthinus...@gmail.com>
Sent: Monday, August 30, 2010 6:58am
To: dev@cassandra.apache.org
Subject: cassandra disk usage

Hi,

Was just looking at a SSTable file after loading a dataset. The data load
has no updates of data  but:
- Columns can in some rare cases be added to existing super columns
- SuperColumns will be added to the same key (but not overwriting existing
data). I batch these, but it is quite likely that there will be 2-3 updates
to a key.

This is a random selected SSTable file from a much bigger dataset.

The data is stored as date(super)/type(column)/value
Date is a simple "20100811" type string.
Value is a small integer, 2 digit on average

If I run a simple strings on the SSTable and look for the data:
value: 692Kbyte of data
type: 4.01MByte of data
date: 4.6MB of data

In total: 9.4MByte

The size of the .db file however, is 36.4MB...

The expansion from the column headers are bad enough, but I can somehow
accept that.
The almost 4x expansion on top of that is a bit harder to justify...

Anyone know already where this expansion comes from? Or I need to take a
careful look at source (probably useful anyway :))

Terje

RE: cassandra disk usage

Reply via email to