That spreadsheet doesn't take compression into account, which is very
important in my case. Uncompressed, my data is going to require a
petabyte of storage according to the spreadsheet. I am pretty sure I
won't get that much storage to play with.
The spreadsheet also shows that Cassandra wastes unbelievable amount of
space on compaction. My experiments with LevelDB however show that it is
possible for write-optimized database to use negligible compaction
space. I am not sure how LevelDB does it. I guess it splits the larger
sstables into smaller chunks and merges them incrementally.
Anyway, does anybody know how densely can I store the data with
Cassandra when compression is enabled? Would I have to implement some
smart adaptive grouping to fit lots of records in one row or is there a
simpler solution?
Dňa 4. 10. 2013 1:56 Andrey Ilinykh wrote / napísal(a):
It may help.
https://docs.google.com/spreadsheet/ccc?key=0Atatq_AL3AJwdElwYVhTRk9KZF9WVmtDTDVhY0xPSmc#gid=0
On Thu, Oct 3, 2013 at 1:31 PM, Robert Važan <robert.va...@gmail.com
<mailto:robert.va...@gmail.com>> wrote:
I need to store one trillion data points. The data is highly
compressible down to 1 byte per data point using simple custom
compression combined with standard dictionary compression. What's
the most space-efficient way to store the data in Cassandra? How
much per-row overhead is there if I store one data point per row?
The data is particularly hard to group. It's a large number of
time series with highly variable density. That makes it hard to
pack subsets of the data into meaningful column families / wide
rows. Is there a table layout scheme that would allow me to
approach the 1B per data point without forcing me to implement
complex abstraction layer on application level?