Hi all, After using Cassandra some time, I had some comments on Cassandra and hope they spark productive conversation on the list. They are meant only as constructive feedback as a user of Cassandra. While there are many things great about Cassandra, I still feel that the current implementation has two major issues that are limiting it's ability to be used in production. There are so many little gotchas that come up which most people don't find out about until you get through most of the implementation. Most of the gotchas, I can live with, but the following items seem like too heavy a cost to me.
1) If you have a result set with thousands of results, like an inbox, there is no way to efficiently handle the pages <- 1 2 3 4 5 6 7 8 9 10 -> except by creating additional data structures on a materialized view. But that means you can only get paged views on materialized views. If you were to add constraints, all the paging functionality no longer works. This is a basic functionality that many, many applications need. Essentially it means that we can only perform the most basic queries in Cassandra and secondary indexes and super columns are near useless. Super Columns are useless for doing complex queries because of a lack of secondary indexes and the fact that it needs to deserialize the entire row to work with it. Regular CF's are no good too for queries with constraints because the paging no longer works since there is no materialized view. There is no way to get the 800th record in a result set without getting ALL the data up to the 800th record. That is crazy! Cassandra desperately needs an efficient capability to return a result set by specifying a start_column by record number, not key. 2) Lack of operational support features. For instance, no capability to manage Cassandra's usage of disk space on nodes. The fact that an admin cannot specify where data goes or how to handle hot data, or gracefully stop handling writes to nodes is a fundamental problem with the partitioning strategy in my opinion. I believe the entire partitioning strategy needs to be revisited and probably rewritten to include capabilities to accept administrator input on how to handle the data (i.e. directories, machines, etc.), easily support moving data and specifying where it should go, how many replicas, etc. As it is, it is just not flexible enough. What if you have particularly hot data and want to replicate it a dozen times to service read requests faster? If a node runs out of space for sstables, I still want it to be operational for read requests, but not write. When nodes are moved, we need to manually run cleanup. Why is that? If there is a safety reason, then how is an administrator going to know better than Cassandra that the operation was successful? I know that Cassandra is a work in progress and there are many limitations I can live with, but it would be nice to know what the roadmap is for the next 12-24 months so we can get an idea of what major directions Cassandra is going in so we can plan accordingly. It would be nice if the community could vote of features considered so that the devs would have an idea of where the major pain points are for the users of Cassandra. The questions that are especially important are... what feature additions are being considered? And, what is being done to improve cassandra's operations management? As clusters get larger, having it run smoothly is critical for success with Cassandra. I can live with less features, but if I get going and the system falls flat in production, that's a terrible situation. Thanks and Happy New Year all! Paul