Btw, I'm not trying to say what you're asking for is a bad idea, or shouldn't / can't be done. If you're asking for a new feature, you should file a JIRA with all the details you provided above. Just keep in mind it'll be a while before it ends up in a stable version. The advice on this ML will usually gravitate towards solving your problem with the tools that are available today, as "wait a year or so" is usually unacceptable.
https://issues.apache.org/jira/browse/cassandra/ On Fri, Aug 5, 2016 at 8:10 AM Jonathan Haddad <j...@jonhaddad.com> wrote: > I think Duy Hai was suggesting Spark Streaming, which gives you the tools > to build exactly what you asked for. A custom compression system for > packing batches of values for a partition into an optimized byte array. > > On Fri, Aug 5, 2016 at 7:46 AM Michael Burman <mibur...@redhat.com> wrote: > >> Hi, >> >> For storing time series data, storage disk usage is quite significant >> factor - time series applications generate a lot of data (and of course the >> newest data is most important). Given that even DateTiered compaction was >> designed in keeping mind of these specialities of time series data, >> wouldn't it make sense to also improve the storage efficiency? Cassandra >> 3.x's one of the key improvements was that improved storage engine - but >> it's still far away from being efficient with time series data. >> >> Efficient compression methods for both floating points & integers have a >> lot of research behind them and can be applied to time series data. I wish >> to apply these methods to improve storage efficiency - and performance* >> >> * In my experience, storing blocks of data and decompressing them on the >> client side instead of letting Cassandra read more rows improves >> performance by several times. The query patterns for time series data are >> often in requesting a range of data (instead of single datapoint). >> >> And I wasn't comparing Cassandra & Hadoop, but the combination of >> Spark+Cassandra+distributed-scheduler+other stuff vs. a Hadoop >> installation. At that point they are quite comparable in many cases, with >> latter being easier to manage in the end. I don't want either for a simple >> time series storage solution as I have no need for other components than >> data storage. >> >> - Micke >> >> ----- Original Message ----- >> From: "Jonathan Haddad" <j...@jonhaddad.com> >> To: user@cassandra.apache.org >> Sent: Friday, August 5, 2016 5:22:58 PM >> Subject: Re: Merging cells in compaction / compression? >> >> Hadoop and Cassandra have very different use cases. If the ability to >> write a custom compression system is the primary factor in how you choose >> your database I suspect you may run into some trouble. >> >> Jon >> >> On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> >> wrote: >> >> > Hi, >> > >> > As Spark is an example of something I really don't want. It's resource >> > heavy, it involves copying data and it involves managing yet another >> > distributed system. Actually I would also need a distributed system to >> > schedule the spark jobs also. >> > >> > Sounds like a nightmare to implement a compression method. Might as well >> > run Hadoop. >> > >> > - Micke >> > >> > ----- Original Message ----- >> > From: "DuyHai Doan" <doanduy...@gmail.com> >> > To: user@cassandra.apache.org >> > Sent: Thursday, August 4, 2016 11:26:09 PM >> > Subject: Re: Merging cells in compaction / compression? >> > >> > Look like you're asking for some sort of ETL on your C* data, why not >> use >> > Spark to compress those data into blobs and use User-Defined-Function to >> > explode them when reading ? >> > >> > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com> >> > wrote: >> > >> > > Hi, >> > > >> > > No, I don't want to lose precision (if that's what you meant), but if >> you >> > > meant just storing them in a larger bucket (which I could decompress >> > either >> > > on client side or server side). To clarify, it could be like: >> > > >> > > 04082016T230215.1234, value >> > > 04082016T230225.4321, value >> > > 04082016T230235.2563, value >> > > 04082016T230245.1145, value >> > > 04082016T230255.0204, value >> > > >> > > -> >> > > >> > > 04082016T230200 -> blob (that has all the points for this minute >> stored - >> > > no data is lost to aggregated avgs or sums or anything). >> > > >> > > That's acceptable, of course the prettiest solution would be to keep >> this >> > > hidden from a client so it would see while decompressing the original >> > rows >> > > (like with byte[] compressors), but this is acceptable for my >> use-case. >> > If >> > > this is what you meant, then yes. >> > > >> > > - Micke >> > > >> > > ----- Original Message ----- >> > > From: "Eric Stevens" <migh...@gmail.com> >> > > To: user@cassandra.apache.org >> > > Sent: Thursday, August 4, 2016 10:26:30 PM >> > > Subject: Re: Merging cells in compaction / compression? >> > > >> > > When you say merge cells, do you mean re-aggregating the data into >> > courser >> > > time buckets? >> > > >> > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com> >> > wrote: >> > > >> > > > Hi, >> > > > >> > > > Considering the following example structure: >> > > > >> > > > CREATE TABLE data ( >> > > > metric text, >> > > > value double, >> > > > time timestamp, >> > > > PRIMARY KEY((metric), time) >> > > > ) WITH CLUSTERING ORDER BY (time DESC) >> > > > >> > > > The natural inserting order is metric, value, timestamp pairs, one >> > > > metric/value pair per second for example. That means creating more >> and >> > > more >> > > > cells to the same partition, which creates a large amount of >> overhead >> > and >> > > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 >> and >> > > > Deflate ~0.10 ratios in some of the examples I've run). Now, to >> improve >> > > > compression ratio, how could I merge the cells on the actual >> Cassandra >> > > > node? I looked at ICompress and it provides only byte-level >> > compression. >> > > > >> > > > Could I do this on the compaction phase, by extending the >> > > > DateTieredCompaction for example? It has SSTableReader/Writer >> > facilities >> > > > and it seems to be able to see the rows? I'm fine with the fact that >> > > repair >> > > > run might have to do some conflict resolution as the final merged >> rows >> > > > would be quite "small" (50kB) in size. The naive approach is of >> course >> > to >> > > > fetch all the rows from Cassandra - merge them on the client and >> send >> > > back >> > > > to the Cassandra, but this seems very wasteful and has its own >> > problems. >> > > > Compared to table-LZ4 I was able to reduce the required size to >> 1/20th >> > > > (context-aware compression is sometimes just so much better) so >> there >> > are >> > > > real benefits to this approach, even if I would probably violate >> > multiple >> > > > design decisions. >> > > > >> > > > One approach is of course to write to another storage first and once >> > the >> > > > blocks are ready, write them to Cassandra. But that again seems >> idiotic >> > > (I >> > > > know some people are using Kafka in front of Cassandra for example, >> but >> > > > that means maintaining yet another distributed solution and defeats >> the >> > > > benefit of Cassandra's easy management & scalability). >> > > > >> > > > Has anyone done something similar? Even planned? If I need to extend >> > > > something in Cassandra I can accept that approach also - but as I'm >> not >> > > > that familiar with Cassandra source code I could use some hints. >> > > > >> > > > - Micke >> > > > >> > > >> > >> >