Hadoop and Cassandra have very different use cases.  If the ability to
write a custom compression system is the primary factor in how you choose
your database I suspect you may run into some trouble.

Jon

On Fri, Aug 5, 2016 at 6:14 AM Michael Burman <mibur...@redhat.com> wrote:

> Hi,
>
> As Spark is an example of something I really don't want. It's resource
> heavy, it involves copying data and it involves managing yet another
> distributed system. Actually I would also need a distributed system to
> schedule the spark jobs also.
>
> Sounds like a nightmare to implement a compression method. Might as well
> run Hadoop.
>
>   - Micke
>
> ----- Original Message -----
> From: "DuyHai Doan" <doanduy...@gmail.com>
> To: user@cassandra.apache.org
> Sent: Thursday, August 4, 2016 11:26:09 PM
> Subject: Re: Merging cells in compaction / compression?
>
> Look like you're asking for some sort of ETL on your C* data, why not use
> Spark to compress those data into blobs and use User-Defined-Function to
> explode them when reading ?
>
> On Thu, Aug 4, 2016 at 10:08 PM, Michael Burman <mibur...@redhat.com>
> wrote:
>
> > Hi,
> >
> > No, I don't want to lose precision (if that's what you meant), but if you
> > meant just storing them in a larger bucket (which I could decompress
> either
> > on client side or server side). To clarify, it could be like:
> >
> > 04082016T230215.1234, value
> > 04082016T230225.4321, value
> > 04082016T230235.2563, value
> > 04082016T230245.1145, value
> > 04082016T230255.0204, value
> >
> > ->
> >
> > 04082016T230200 -> blob (that has all the points for this minute stored -
> > no data is lost to aggregated avgs or sums or anything).
> >
> > That's acceptable, of course the prettiest solution would be to keep this
> > hidden from a client so it would see while decompressing the original
> rows
> > (like with byte[] compressors), but this is acceptable for my use-case.
> If
> > this is what you meant, then yes.
> >
> >   -  Micke
> >
> > ----- Original Message -----
> > From: "Eric Stevens" <migh...@gmail.com>
> > To: user@cassandra.apache.org
> > Sent: Thursday, August 4, 2016 10:26:30 PM
> > Subject: Re: Merging cells in compaction / compression?
> >
> > When you say merge cells, do you mean re-aggregating the data into
> courser
> > time buckets?
> >
> > On Thu, Aug 4, 2016 at 5:59 AM Michael Burman <mibur...@redhat.com>
> wrote:
> >
> > > Hi,
> > >
> > > Considering the following example structure:
> > >
> > > CREATE TABLE data (
> > > metric text,
> > > value double,
> > > time timestamp,
> > > PRIMARY KEY((metric), time)
> > > ) WITH CLUSTERING ORDER BY (time DESC)
> > >
> > > The natural inserting order is metric, value, timestamp pairs, one
> > > metric/value pair per second for example. That means creating more and
> > more
> > > cells to the same partition, which creates a large amount of overhead
> and
> > > reduces the compression ratio of LZ4 & Deflate (LZ4 reaches ~0.26 and
> > > Deflate ~0.10 ratios in some of the examples I've run). Now, to improve
> > > compression ratio, how could I merge the cells on the actual Cassandra
> > > node? I looked at ICompress and it provides only byte-level
> compression.
> > >
> > > Could I do this on the compaction phase, by extending the
> > > DateTieredCompaction for example? It has SSTableReader/Writer
> facilities
> > > and it seems to be able to see the rows? I'm fine with the fact that
> > repair
> > > run might have to do some conflict resolution as the final merged rows
> > > would be quite "small" (50kB) in size. The naive approach is of course
> to
> > > fetch all the rows from Cassandra - merge them on the client and send
> > back
> > > to the Cassandra, but this seems very wasteful and has its own
> problems.
> > > Compared to table-LZ4 I was able to reduce the required size to 1/20th
> > > (context-aware compression is sometimes just so much better) so there
> are
> > > real benefits to this approach, even if I would probably violate
> multiple
> > > design decisions.
> > >
> > > One approach is of course to write to another storage first and once
> the
> > > blocks are ready, write them to Cassandra. But that again seems idiotic
> > (I
> > > know some people are using Kafka in front of Cassandra for example, but
> > > that means maintaining yet another distributed solution and defeats the
> > > benefit of Cassandra's easy management & scalability).
> > >
> > > Has anyone done something similar? Even planned? If I need to extend
> > > something in Cassandra I can accept that approach also - but as I'm not
> > > that familiar with Cassandra source code I could use some hints.
> > >
> > >   - Micke
> > >
> >
>

Reply via email to