Hadoop and Cassandra have very different use cases. If the ability to write a custom compression system is the primary factor in how you choose your database I suspect you may run into some trouble.
Jon On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> wrote: > Hi, > > As Spark is an example of something I really don't want. It's resource > heavy, it involves copying data and it involves managing yet another > distributed system. Actually I would also need a distributed system to > schedule the spark jobs also. > > Sounds like a nightmare to implement a compression method. Might as well > run Hadoop. > > - Micke > > ----- Original Message ----- > From: "DuyHai Doan" <doanduy...@gmail.com> > To: user@cassandra.apache.org > Sent: Thursday, August 4, 2016 11:26:09 PM > Subject: Re: Merging cells in compaction / compression? > > Look like you're asking for some sort of ETL on your C* data, why not use > Spark to compress those data into blobs and use User-Defined-Function to > explode them when reading ? > > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> > wrote: > > > Hi, > > > > No, I don't want to lose precision (if that's what you meant), but if you > > meant just storing them in a larger bucket (which I could decompress > either > > on client side or server side). To clarify, it could be like: > > > > 04082016T230215.1234, value > > 04082016T230225.4321, value > > 04082016T230235.2563, value > > 04082016T230245.1145, value > > 04082016T230255.0204, value > > > > -> > > > > 04082016T230200 -> blob (that has all the points for this minute stored - > > no data is lost to aggregated avgs or sums or anything). > > > > That's acceptable, of course the prettiest solution would be to keep this > > hidden from a client so it would see while decompressing the original > rows > > (like with byte[] compressors), but this is acceptable for my use-case. > If > > this is what you meant, then yes. > > > > - Micke > > > > ----- Original Message ----- > > From: "Eric Stevens" <migh...@gmail.com> > > To: user@cassandra.apache.org > > Sent: Thursday, August 4, 2016 10:26:30 PM > > Subject: Re: Merging cells in compaction / compression? > > > > When you say merge cells, do you mean re-aggregating the data into > courser > > time buckets? > > > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> > wrote: > > > > > Hi, > > > > > > Considering the following example structure: > > > > > > CREATE TABLE data ( > > > metric text, > > > value double, > > > time timestamp, > > > PRIMARY KEY((metric), time) > > > ) WITH CLUSTERING ORDER BY (time DESC) > > > > > > The natural inserting order is metric, value, timestamp pairs, one > > > metric/value pair per second for example. That means creating more and > > more > > > cells to the same partition, which creates a large amount of overhead > and > > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and > > > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve > > > compression ratio, how could I merge the cells on the actual Cassandra > > > node? I looked at ICompress and it provides only byte-level > compression. > > > > > > Could I do this on the compaction phase, by extending the > > > DateTieredCompaction for example? It has SSTableReader/Writer > facilities > > > and it seems to be able to see the rows? I'm fine with the fact that > > repair > > > run might have to do some conflict resolution as the final merged rows > > > would be quite "small" (50kB) in size. The naive approach is of course > to > > > fetch all the rows from Cassandra - merge them on the client and send > > back > > > to the Cassandra, but this seems very wasteful and has its own > problems. > > > Compared to table-LZ4 I was able to reduce the required size to 1/20th > > > (context-aware compression is sometimes just so much better) so there > are > > > real benefits to this approach, even if I would probably violate > multiple > > > design decisions. > > > > > > One approach is of course to write to another storage first and once > the > > > blocks are ready, write them to Cassandra. But that again seems idiotic > > (I > > > know some people are using Kafka in front of Cassandra for example, but > > > that means maintaining yet another distributed solution and defeats the > > > benefit of Cassandra's easy management & scalability). > > > > > > Has anyone done something similar? Even planned? If I need to extend > > > something in Cassandra I can accept that approach also - but as I'm not > > > that familiar with Cassandra source code I could use some hints. > > > > > > - Micke > > > > > >