Re: column bloat

aaron morton Tue, 10 May 2011 16:07:07 -0700

> For a reasonable large amount of use cases (for me, 2 out of 3 at the moment) 
> supercolumns will be units of data where the columns (attributes) will never 
> change by themselves or where the data does not change anyway (archived data).


Can you use a standard CF and pack the multiple columns into one value in your 
app ? It sounds like the super columns are just acting as opaque containers, 
and cassandra does not need to know these are different values. Agree this only 
works if there is no concurrent access on the sub columns. I'm suggesting this 
with one eye on https://issues.apache.org/jira/browse/CASSANDRA-2231 

> It would seem like a good optimization to allow a timestamp on the 
> supercolumn instead and remove the one on columns?
> 
> I believe this may also work as an optimization on compactions? Just skip 
> merging of columns under the supercolumn if the supercolumn has a timestamp 
> and just replace the entire supercolumn in that case.
> 
> Could be just a variation of the supercolumn object on insert. No timestamp, 
> use the one in the columns, include timestamp, ignore timestamps in columns.

SC's are more containers than columns, when it comes to reconciling their 
contents they act like column families: ask the columns to reconcile respecting 
the containers tombstone. Giving the SC a timestamp and making them act like 
columns would be a major change. 

 A
   
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 11 May 2011, at 03:30, Terje Marthinussen wrote:

> 
> Anyway, to sum that up, expiring columns are 1 byte more and
> non-expiring ones are 7 bytes
> less. Not arguing, it's still fairly verbose, especially with tons of
> very small columns.
> 
> Yes, you are right, sorry.
> Trying to do one thing to many at the same time. 
> My brain filtered out part of the "else if".
>  
> 
> > - inherit timestamps from the supercolumn
> 
> Columns inside a supercolumn have no reason to share the same timestamp (or
> even close ones for that matter). But maybe you're talking about something 
> more
> subtle, in which case yes there is ways to compress the data.
> 
> For a reasonable large amount of use cases (for me, 2 out of 3 at the moment) 
> supercolumns will be units of data where the columns (attributes) will never 
> change by themselves or where the data does not change anyway (archived data).
> 
> It would seem like a good optimization to allow a timestamp on the 
> supercolumn instead and remove the one on columns?
> 
> I believe this may also work as an optimization on compactions? Just skip 
> merging of columns under the supercolumn if the supercolumn has a timestamp 
> and just replace the entire supercolumn in that case.
> 
> Could be just a variation of the supercolumn object on insert. No timestamp, 
> use the one in the columns, include timestamp, ignore timestamps in columns.
> 
> If that sounds like a sensible idea, I may be tempted to try to get time to 
> implement it. 
> 
> I am also tempted to do some other things like make some of the "ints" and 
> "shorts" variable length as well.
> 
> Terje

Re: column bloat

Reply via email to