On 2010-03-30 05:54, Julian Simon wrote:
> My understanding is that Cassandra never updates data "in place" on
> disk - instead it completely re-creates the data files during a
> "flush".  Stop me if I'm wrong already ;-)

You're correct that existing SSTables are immutable; they are retired
following compaction rather than modified.

> So imagine we have a large data set in our ColumnFamily and we're
> constantly adding data to it.

Sounds like a good reason to consider Cassandra.

> Every [x] minutes or [y] bytes, the compaction process is triggered,
> and the entire data set is written to disk.

The compaction process takes several stages:
http://wiki.apache.org/cassandra/MemtableSSTable

> So as our data set grows over time, the compaction process will result
> in an increasingly large IO operation to write all that data to disk
> each time.

I'll interpret that to mean "an increasingly large IO operation [for
each node] to write all that data to disk each time."

That is not entirely correct from an operational standpoint. In a
cluster where the node count exceeds ReplicationFactor, a single server
only handles a fraction of the rows in each CF. If a cluster ever
reached the point where, on some boxes, compaction required too much IO
to sustain, you would simply expand the cluster (keeping the same
ReplicationFactor). This would distribute the load over more nodes.

> We could easily be talking about single data files in the
> many-gigabyte size range, no?  Or is there a file size limit that I'm
> not aware of?

It's certainly possible to reach a multi-GB size for the SSTable files,
but that should not be a problem.

> If not, is this an efficient approach to take for large data sets?
> Seems like we would become awfully IO bound, writing the entire thing
> from scratch each time.
> 
> Do let me know if I've gotten it all wrong ;-)

The mistake you're making is assuming that IO capability is equivalent
for sequential and random activity. A system that replaces items on disk
(when possible) may write fewer bytes to disk, but it is at the cost of
many seeks. On a standard hard disk, those seeks are expensive.
Cassandra is optimized around avoiding seeks on write, even if it has to
write much more data sequentially over the long term.

Imagine I was asking you to fetch water from around the city. For the
first round, I have you fetch 100 thimbles of water randomly scattered
over the city. It's not very much water, but it takes you a very long time.

On the second round, I have you fetch 20x the water in volume (say, 20
buckets), but it's all on one street corner. Despite the considerable
increase in volume, you're done much faster on the second round.

Then, on an ongoing basis, you have the daily choice of fetching 30
thimbles (30% of the original thimble set) all over town or 20 buckets
(100% of the original set) on a designated street corner.

You'd want to pick the bucket option -- despite the higher volume -- and
your hard disk would agree.

-- 
David Strauss
   | da...@fourkitchens.com
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to