Cassandra does "minor" compactions with a minimum of 4 sstables in the
same "bucket," with buckets doubling in size as you compact.  So you
only ever rewrite all data in your weekly-ish major compaction for
tombstone cleanup and anti entropy.

-Jonathan

On Tue, Mar 30, 2010 at 12:54 AM, Julian Simon <jsi...@jules.com.au> wrote:
> Forgive me as I'm probably a little out of my depth in trying to
> assess this particular design choice within Cassandra, but...
>
> My understanding is that Cassandra never updates data "in place" on
> disk - instead it completely re-creates the data files during a
> "flush".  Stop me if I'm wrong already ;-)
>
> So imagine we have a large data set in our ColumnFamily and we're
> constantly adding data to it.
>
> Every [x] minutes or [y] bytes, the compaction process is triggered,
> and the entire data set is written to disk.
>
> So as our data set grows over time, the compaction process will result
> in an increasingly large IO operation to write all that data to disk
> each time.
>
> We could easily be talking about single data files in the
> many-gigabyte size range, no?  Or is there a file size limit that I'm
> not aware of?
>
> If not, is this an efficient approach to take for large data sets?
> Seems like we would become awfully IO bound, writing the entire thing
> from scratch each time.
>
> Do let me know if I've gotten it all wrong ;-)
>
> Cheers,
> Jules
>

Reply via email to