Btw, I'm not trying to say what you're asking for is a bad idea, or
shouldn't / can't be done.  If you're asking for a new feature, you should
file a JIRA with all the details you provided above.  Just keep in mind
it'll be a while before it ends up in a stable version.  The advice on this
ML will usually gravitate towards solving your problem with the tools that
are available today, as "wait a year or so" is usually unacceptable.

https://issues.apache.org/jira/browse/cassandra/

On Fri, Aug 5, 2016 at 8:10 AM Jonathan Haddad <j...@jonhaddad.com> wrote:

> I think Duy Hai was suggesting Spark Streaming, which gives you the tools
> to build exactly what you asked for.  A custom compression system for
> packing batches of values for a partition into an optimized byte array.
>
> On Fri, Aug 5, 2016 at 7:46 AM Michael Burman <mibur...@redhat.com> wrote:
>
>> Hi,
>>
>> For storing time series data, storage disk usage is quite significant
>> factor - time series applications generate a lot of data (and of course the
>> newest data is most important). Given that even DateTiered compaction was
>> designed in keeping mind of these specialities of time series data,
>> wouldn't it make sense to also improve the storage efficiency? Cassandra
>> 3.x's one of the key improvements was that improved storage engine - but
>> it's still far away from being efficient with time series data.
>>
>> Efficient compression methods for both floating points & integers have a
>> lot of research behind them and can be applied to time series data. I wish
>> to apply these methods to improve storage efficiency - and performance*
>>
>> * In my experience, storing blocks of data and decompressing them on the
>> client side instead of letting Cassandra read more rows improves
>> performance by several times. The query patterns for time series data are
>> often in requesting a range of data (instead of single datapoint).
>>
>> And I wasn't comparing Cassandra & Hadoop, but the combination of
>> Spark+Cassandra+distributed-scheduler+other stuff vs. a Hadoop
>> installation. At that point they are quite comparable in many cases, with
>> latter being easier to manage in the end. I don't want either for a simple
>> time series storage solution as I have no need for other components than
>> data storage.
>>
>>   - Micke
>>
>> ----- Original Message -----
>> From: "Jonathan Haddad" <j...@jonhaddad.com>
>> To: user@cassandra.apache.org
>> Sent: Friday, August 5, 2016 5:22:58 PM
>> Subject: Re: Merging cells in compaction / compression?
>>
>> Hadoop and Cassandra have very different use cases.  If the ability to
>> write a custom compression system is the primary factor in how you choose
>> your database I suspect you may run into some trouble.
>>
>> Jon
>>
>> On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com>
>> wrote:
>>
>> > Hi,
>> >
>> > As Spark is an example of something I really don't want. It's resource
>> > heavy, it involves copying data and it involves managing yet another
>> > distributed system. Actually I would also need a distributed system to
>> > schedule the spark jobs also.
>> >
>> > Sounds like a nightmare to implement a compression method. Might as well
>> > run Hadoop.
>> >
>> >   - Micke
>> >
>> > ----- Original Message -----
>> > From: "DuyHai Doan" <doanduy...@gmail.com>
>> > To: user@cassandra.apache.org
>> > Sent: Thursday, August 4, 2016 11:26:09 PM
>> > Subject: Re: Merging cells in compaction / compression?
>> >
>> > Look like you're asking for some sort of ETL on your C* data, why not
>> use
>> > Spark to compress those data into blobs and use User-Defined-Function to
>> > explode them when reading ?
>> >
>> > On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > No, I don't want to lose precision (if that's what you meant), but if
>> you
>> > > meant just storing them in a larger bucket (which I could decompress
>> > either
>> > > on client side or server side). To clarify, it could be like:
>> > >
>> > > 04082016T230215.1234, value
>> > > 04082016T230225.4321, value
>> > > 04082016T230235.2563, value
>> > > 04082016T230245.1145, value
>> > > 04082016T230255.0204, value
>> > >
>> > > ->
>> > >
>> > > 04082016T230200 -> blob (that has all the points for this minute
>> stored -
>> > > no data is lost to aggregated avgs or sums or anything).
>> > >
>> > > That's acceptable, of course the prettiest solution would be to keep
>> this
>> > > hidden from a client so it would see while decompressing the original
>> > rows
>> > > (like with byte[] compressors), but this is acceptable for my
>> use-case.
>> > If
>> > > this is what you meant, then yes.
>> > >
>> > >   -  Micke
>> > >
>> > > ----- Original Message -----
>> > > From: "Eric Stevens" <migh...@gmail.com>
>> > > To: user@cassandra.apache.org
>> > > Sent: Thursday, August 4, 2016 10:26:30 PM
>> > > Subject: Re: Merging cells in compaction / compression?
>> > >
>> > > When you say merge cells, do you mean re-aggregating the data into
>> > courser
>> > > time buckets?
>> > >
>> > > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com>
>> > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > Considering the following example structure:
>> > > >
>> > > > CREATE TABLE data (
>> > > > metric text,
>> > > > value double,
>> > > > time timestamp,
>> > > > PRIMARY KEY((metric), time)
>> > > > ) WITH CLUSTERING ORDER BY (time DESC)
>> > > >
>> > > > The natural inserting order is metric, value, timestamp pairs, one
>> > > > metric/value pair per second for example. That means creating more
>> and
>> > > more
>> > > > cells to the same partition, which creates a large amount of
>> overhead
>> > and
>> > > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26
>> and
>> > > > Deflate ~0.10 ratios in some of the examples I've run). Now, to
>> improve
>> > > > compression ratio, how could I merge the cells on the actual
>> Cassandra
>> > > > node? I looked at ICompress and it provides only byte-level
>> > compression.
>> > > >
>> > > > Could I do this on the compaction phase, by extending the
>> > > > DateTieredCompaction for example? It has SSTableReader/Writer
>> > facilities
>> > > > and it seems to be able to see the rows? I'm fine with the fact that
>> > > repair
>> > > > run might have to do some conflict resolution as the final merged
>> rows
>> > > > would be quite "small" (50kB) in size. The naive approach is of
>> course
>> > to
>> > > > fetch all the rows from Cassandra - merge them on the client and
>> send
>> > > back
>> > > > to the Cassandra, but this seems very wasteful and has its own
>> > problems.
>> > > > Compared to table-LZ4 I was able to reduce the required size to
>> 1/20th
>> > > > (context-aware compression is sometimes just so much better) so
>> there
>> > are
>> > > > real benefits to this approach, even if I would probably violate
>> > multiple
>> > > > design decisions.
>> > > >
>> > > > One approach is of course to write to another storage first and once
>> > the
>> > > > blocks are ready, write them to Cassandra. But that again seems
>> idiotic
>> > > (I
>> > > > know some people are using Kafka in front of Cassandra for example,
>> but
>> > > > that means maintaining yet another distributed solution and defeats
>> the
>> > > > benefit of Cassandra's easy management & scalability).
>> > > >
>> > > > Has anyone done something similar? Even planned? If I need to extend
>> > > > something in Cassandra I can accept that approach also - but as I'm
>> not
>> > > > that familiar with Cassandra source code I could use some hints.
>> > > >
>> > > >   - Micke
>> > > >
>> > >
>> >
>>
>

Reply via email to