Re: Cassandra gotchas ...

Paul Pak Sat, 08 Jan 2011 17:58:20 -0800

Hi all,

After using Cassandra some time, I had some comments on Cassandra and
hope they spark productive conversation on the list.  They are meant
only as constructive feedback as a user of Cassandra.  While there are
many things great about Cassandra, I still feel that the current
implementation has two major issues that are limiting it's ability to be
used in production.  There are so many little gotchas that come up which
most people don't find out about until you get through most of the
implementation.  Most of the gotchas, I can live with, but the following
items seem like too heavy a cost to me.


1) If you have a result set with thousands of results, like an inbox,
there is no way to efficiently handle the pages <- 1 2 3 4 5 6 7 8 9 10
-> except by creating additional data structures on a materialized
view.  But that means you can only get paged views on materialized
views.  If you were to add constraints, all the paging functionality no
longer works.  This is a basic functionality that many, many
applications need.  Essentially it means that we can only perform the
most basic queries in Cassandra and secondary indexes and super columns
are near useless.  Super Columns are useless for doing complex queries
because of a lack of secondary indexes and the fact that it needs to
deserialize the entire row to work with it.  Regular CF's are no good
too for queries with constraints because the paging no longer works
since there is no materialized view.  There is no way to get the 800th
record in a result set without getting ALL the data up to the 800th
record.  That is crazy!  Cassandra desperately needs an efficient
capability to return a result set by specifying a start_column by record
number, not key.

2) Lack of operational support features.  For instance, no capability to
manage Cassandra's usage of disk space on nodes.  The fact that an admin
cannot specify where data goes or how to handle hot data, or gracefully
stop handling writes to nodes is a fundamental problem with the
partitioning strategy in my opinion.  I believe the entire partitioning
strategy needs to be revisited and probably rewritten to include
capabilities to accept administrator input on how to handle the data
(i.e. directories, machines, etc.), easily support moving data and
specifying where it should go, how many replicas, etc.  As it is, it is
just not flexible enough.   What if you have particularly hot data and
want to replicate it a dozen times to service read requests faster?  If
a node runs out of space for sstables, I still want it to be operational
for read requests, but not write.  When nodes are moved, we need to
manually run cleanup.  Why is that?  If there is a safety reason, then
how is an administrator going to know better than Cassandra that the
operation was successful?

I know that Cassandra is a work in progress and there are many
limitations I can live with, but it would be nice to know what the
roadmap is for the next 12-24 months so we can get an idea of what major
directions Cassandra is going in so we can plan accordingly.  It would
be nice if the community could vote of features considered so that the
devs would have an idea of where the major pain points are for the users
of Cassandra.  The questions that are especially important are...  what
feature additions are being considered?  And, what is being done to
improve cassandra's operations management?  As clusters get larger,
having it run smoothly is critical for success with Cassandra.  I can
live with less features, but if I get going and the system falls flat in
production, that's a terrible situation.  Thanks and Happy New Year all!

Paul

Re: Cassandra gotchas ...

Reply via email to