Hey all -
With the uptick in discussion around Cassandra operability and after discussing 
potential solutions with various members of the community, we would like to 
propose the addition of a management process/sub-project into Apache Cassandra. 
The process would be responsible for common operational tasks like bulk 
execution of nodetool commands, backup/restore, and health checks, among 
others. We feel we have a proposal that will garner some discussion and debate 
but is likely to reach consensus.
While the community, in large part, agrees that these features should exist “in 
the database”, there is debate on how they should be implemented. Primarily, 
whether or not to use an external process or build on CassandraDaemon. This is 
an important architectural decision but we feel the most critical aspect is not 
where the code runs but that the operator still interacts with the notion of a 
single database. Multi-process databases are as old as Postgres and continue to 
be common in newer systems like Druid. As such, we propose a separate 
management process for the following reasons:
   
   - Resource isolation & Safety: Features in the management process will not 
affect C*'s read/write path which is critical for stability. An isolated 
process has several technical advantages including preventing use of 
unnecessary dependencies in CassandraDaemon, separation of JVM resources like 
thread pools and heap, and preventing bugs from adversely affecting the main 
process. In particular, GC tuning can be done separately for the two processes, 
hopefully helping to improve, or at least not adversely affect, tail latencies 
of the main process.    

   - Health Checks & Recovery: Currently users implement health checks in their 
own sidecar process. Implementing them in the serving process does not make 
sense because if the JVM running the CassandraDaemon goes south, the 
healthchecks and potentially any recovery code may not be able to run. Having a 
management process running in isolation opens up the possibility to not only 
report the health of the C* process such as long GC pauses or stuck JVM but 
also to recover from it. Having a list of basic health checks that are tested 
with every C* release and officially supported will help boost confidence in C* 
quality and make it easier to operate.   

   - Reduced Risk: By having a separate Daemon we open the possibility to 
contribute features that otherwise would not have been considered before eg. a 
UI. A library that started many background threads and is operated completely 
differently would likely be considered too risky for CassandraDaemon but is a 
good candidate for the management process.   


What can go into the management process?   
   - Features that are non-essential for serving reads & writes for eg. 
Backup/Restore or Running Health Checks against the CassandraDaemon, etc.   

   - Features that do not make the management process critical for functioning 
of the serving process. In other words, if someone does not wish to use this 
management process, they are free to disable it.

We would like to initially build minimal set of features such as health checks 
and bulk commands into the first iteration of the management process. We would 
use the same software stack that is used to build the current CassandraDaemon 
binary. This would be critical for sharing code between CassandraDaemon & 
management processes. The code should live in-tree to make this easy.
With regards to more in-depth features like repair scheduling and discussions 
around compaction in or out of CassandraDaemon, while the management process 
may be a suitable host, it is not our goal to decide that at this time. The 
management process could be used in these cases, as they meet the criteria 
above, but other technical/architectural reasons may exists for why it should 
not be.
We are looking forward to your comments on our proposal,
Dinesh Joshi and Jordan West

Reply via email to