When you say merge cells, do you mean re-aggregating the data into courser time buckets?
On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > Considering the following example structure: > > CREATE TABLE data ( > metric text, > value double, > time timestamp, > PRIMARY KEY((metric), time) > ) WITH CLUSTERING ORDER BY (time DESC) > > The natural inserting order is metric, value, timestamp pairs, one > metric/value pair per second for example. That means creating more and more > cells to the same partition, which creates a large amount of overhead and > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > compression ratio, how could I merge the cells on the actual Cassandra > node? I looked at ICompress and it provides only byte-level compression. > > Could I do this on the compaction phase, by extending the > DateTieredCompaction for example? It has SSTableReader/Writer facilities > and it seems to be able to see the rows? I'm fine with the fact that repair > run might have to do some conflict resolution as the final merged rows > would be quite "small" (50kB) in size. The naive approach is of course to > fetch all the rows from Cassandra - merge them on the client and send back > to the Cassandra, but this seems very wasteful and has its own problems. > Compared to table-LZ4 I was able to reduce the required size to 1/20th > (context-aware compression is sometimes just so much better) so there are > real benefits to this approach, even if I would probably violate multiple > design decisions. > > One approach is of course to write to another storage first and once the > blocks are ready, write them to Cassandra. But that again seems idiotic (I > know some people are using Kafka in front of Cassandra for example, but > that means maintaining yet another distributed solution and defeats the > benefit of Cassandra's easy management & scalability). > > Has anyone done something similar? Even planned? If I need to extend > something in Cassandra I can accept that approach also - but as I'm not > that familiar with Cassandra source code I could use some hints. > > - Micke >