RE: Consequences of having many columns

Kochheiser,Todd W - TOK-DITT-1 Tue, 13 Jul 2010 13:01:07 -0700

So it would appear that 0.7 will have solved the requirement that a single row 
must be able to fit in memory.  That issue aside, how would one expect the 
read/write performance to be in the scenarios listed below?

________________________________
From: Mason Hale [mailto:ma...@onespot.com]
Sent: Tuesday, July 13, 2010 8:41 AM
To: user@cassandra.apache.org
Subject: Re: Consequences of having many columns

Currently there is a limitation that each row must fit in memory (with some not 
insignificant overhead), thus having lots of columns per row can trigger 
out-of-memory errors. This limitation should be removed in a future release.

Please see:
  - http://wiki.apache.org/cassandra/CassandraLimitations
  - https://issues.apache.org/jira/browse/CASSANDRA-16  (notice this is marked 
as resolved now)

Mason
On Tue, Jul 13, 2010 at 9:38 AM, Kochheiser,Todd W - TOK-DITT-1 
<twkochhei...@bpa.gov<mailto:twkochhei...@bpa.gov>> wrote:
I recently ran across a blog posting with a comment from a Cassandra committer 
that indicated a performance penalty when having a large number of columns per 
row/key.  Unfortunately I didn't bookmark the blog posting and now I can't find 
it.  Regardless, since our current plan and design is to have several thousand 
columns per row/key, it made me question our design and if it might cause 
unintended performance consequences.  As a somewhat concrete example for 
discussion purposes, which of the following scenarios would "potential" perform 
better or worse?

Assume:

 *   Single ColumnFamily
 *   Three node cluster
 *   10 to 1 read/write ratio (10 reads to every write)

Scenario A:

 *   10k rows
 *   5k columns/row
 *   Each column ~ 64kB
 *   Hot spot for writes and reads would be a single column in each row (the 
hot column would change every hour).  We would be accessing every row 
constantly, but in general accessing just a few columns in each.
 *   A low volume of reads accessing ~100 columns per row (range queries would 
work)
 *   Access is generally direct (row key / column key)
 *   Data growth would be horizontal (adding columns) as apposed to vertically 
(adding rows)
 *   This is our current design

Scenario B:

 *   50M rows/keys
 *   1 column/key
 *   Each column ~ 64kB
 *   Hot spot for writes and reads would be the single column in 10k rows, but 
the 10k rows accessed would change every hour.
 *   Access would generally be direct (row key / column key)
 *   Data growth would be vertically (adding rows 10k at a time) as apposed to 
horizontal (adding columns)

Scenario C:

 *   5k rows/keys
 *   10k columns/row
 *   Each column ~64kB
 *   Hot spot for writes and reads would be every column in a single row.  Row 
being access would change every hour
 *   Access is generally direct (row key / column key)
 *   Low volume of queries accessing a single column in many rows
 *   Data growth would be by adding rows, each with 10k column.

In all three scenarios the amount of data is the same but the access pattern in 
different.  From an application coding perspective any of the approaches are 
feasible, although the data is easier to think about in Scenario A (i.e. fewer 
mental gymnastics and fewer composite keys).  In all of the scenarios there are 
10k columns that are constantly accessed (read and write).

Some thoughts: Scenario A has the advantage of evenly distributing reads/writes 
across all cluster nodes (I think).  Scenario B has the potential advantage of 
having one column per row (I think) but *not* necessarily distributing evenly 
reads/writes across all cluster nodes.  I'm not serious about Scenario C, but 
it is an option.  Scenario C would probably cause one node in the cluster to 
take the brunt of all reads/writes so I think this design would be a bad idea.  
And, if having lots of columns is a bad idea then this is even worse than 
scenario A.

Regards,
Todd

RE: Consequences of having many columns

Reply via email to