>  Our team decided to use Cassandra as storage solution to a dataset.
> I am very new to the NoSQL world and Cassandra so I am hoping to get
> some help from the community: The dataset is pretty simple, we have
> for each key a number of columns with values. Each day we compute a
> new version of this dataset, the new version will mostly update
> existing keys but could also add and delete some keys. (And we'll
> build a service that queries Cassandra). A key requirement for us is
> we want to keep versions of the dataset and keep N versions around,
> this is in case we discover problems in the current version and need
> to "roll up" to an older one. I thought about creating a Column Family
> per version, this means we will create a new column family every day
> and occasionally delete column families according to some truncation
> policy. I know Cassandra 0.7 now makes changing schema easier, but is
> this a good way to go? I would really like to hear what you guys think
> is the better way to handle this. Thank you.

What you propose would presumably work and in some sense be efficient
if you are intending to re-load the entire data set each day. However,
I wouldn't say column families are intended to be used like this and
it sounds like it would be better to model the data differently.
Column families have implications on e.g. memtables sizes; they are a
bit sensitive in the way you submit them to a cluster (only one schema
change may be propagating at a time, etc). In addition it implies that
you have to keep track of and manage the schema somehow at a
meta-level.

While I don't see an actual unavoidable problem with it (e.g. memtable
sizes is not an issue if you completely stop writing to old memtables,
schema change propagation can of course be solved, etc), my feeling is
that you are likely better off modelling your data differently.

It sounds like a possible way to go  would be a CF with super columns
whose names are your time periods and whose values are your columns.
That assumes a small amount of data per row in each period. This way
you will have access to the latest version of your data by doing a
slice of size 1 (sorting on the period), while older versions will be
remain accessible and individually delete:able. This assumes each
super column is updated completely or not at all, and that you're not
having to do merges across multiple super columns in your application.

Something to keep in mind here is that all insertions of new data and
updates will be to pre-existing rows (unless the row key is completely
new) so you would expect reads to be spread out over sstables on disk.

An alternative is to introduce a level of indirection (if you can live
with that); have a column family that maps rows to columns whose names
are your periods and whose values are row keys for another column
family. Then each row in the pointed-to column family will be your
data for that particular period. This would probably be advantageous
in particular if you intend to have a lot of data associated with each
key, since the data would directly map to a single row in Cassandra
(and large rows are okay and give you efficient access to individual
columns or column ranges).

(But it still assumes you're updating either all the data or no data
for a given row key in a particular time period.)

-- 
/ Peter Schuller

Reply via email to