[
https://issues.apache.org/jira/browse/CASSSIDECAR-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18016891#comment-18016891
]
Andrés Beck-Ruiz edited comment on CASSSIDECAR-274 at 8/28/25 9:36 PM:
-----------------------------------------------------------------------
I agree that having a single centralized API for all operations is ideal from a
user experience perspective.
{quote}Is there a particular benefit (or limitation with the existing
framework) to keeping these operation types distinct at the API level that I
might be missing?
{quote}
These are the original limitations I found with the existing framework, where
it might have to be enhanced:
* The framework does not persist jobs if a Sidecar instance crashes, which
CASSSIDECAR-341 would address
* Sidecar instances can't query the status of non-local jobs, which again
could be addressed by CASSSIDECAR-341
* There isn't an ability to update the state of a currently running job. The
ability to pause or abort a restart would be important for ensuring operational
safety.
* The current
[OperationalJobResponse|https://github.com/apache/cassandra-sidecar/blob/trunk/client-common/src/main/java/org/apache/cassandra/sidecar/common/response/OperationalJobResponse.java]
is not verbose enough to allow proper visibility into a restart job or other
cluster-wide operations. It would be important for an operator to understand
which individual nodes have failed/succeeded to restart, for example.
I have a draft of a CEP for approaching rolling restarts via Sidecar ready, and
it includes a design for durable, cluster-accessible operations that could
address CASSSIDECAR-341 and an extensible approach to cluster-wide operations
as well. I am planning to open it so that the larger community can give
feedback as well, and am open to further discussion about how this API could be
organized and whether we should extend the current job management framework.
was (Author: JIRAUSER310114):
I agree that having a single centralized API for all operations is ideal from a
user experience perspective.
{quote}Is there a particular benefit (or limitation with the existing
framework) to keeping these operation types distinct at the API level that I
might be missing?
{quote}
These are the original limitations I found with the existing framework, where
it might have to be enhanced:
* The framework does not persist jobs if a Sidecar instance crashes, which
CASSSIDECAR-341 would address
* Sidecar instances can't query the status of non-local jobs, which again
could be addressed by CASSSIDECAR-341
* There isn't an ability to update the state of a currently running job. The
ability to pause or abort a restart would be important for ensuring operational
safety.
* The current
[OperationalJobResponse|https://github.com/apache/cassandra-sidecar/blob/trunk/client-common/src/main/java/org/apache/cassandra/sidecar/common/response/OperationalJobResponse.java]
is not verbose enough to allow proper visibility into a restart job or other
cluster-wide operations. It would be important for an operator to understand
which individual nodes have failed/succeeded to restart, for example.
I have a draft of a CEP for approaching rolling restarts via Sidecar ready, and
it includes a design for durable, cluster-accessible operations that could
address CASSSIDECAR-341 and an extensible approach to cluster-wide operations
as well. I am planning to open it so that the larger community can give
feedback, and am open to further discussion about how this API could be
organized and whether we should extend the current job management framework.
> Enable rolling restarts of Cassandra clusters via Sidecar
> ---------------------------------------------------------
>
> Key: CASSSIDECAR-274
> URL: https://issues.apache.org/jira/browse/CASSSIDECAR-274
> Project: Sidecar for Apache Cassandra
> Issue Type: Improvement
> Reporter: Isaac Reath
> Priority: Major
> Attachments: Screenshot 2025-08-13 at 12.34.43 PM.png
>
>
> Rolling restarts are frequently used in Cassandra to apply changes to a
> cluster such as configuration changes, or version upgrades. In
> CASSSIDECAR-266, we are adding functionality to safely start and stop a
> single Cassandra node via Sidecar. This ticket will build on that work to
> implement a coordinated rolling restart.
> The scope of this effort includes:
> * Adding API endpoints to enable operators to start, monitor, pause and stop
> a rolling restart.
> * Updating Sidecar to orchestrate start and stop operations across the
> cluster, allowing for a configurable amount of nodes to be offline
> simultaneously.
> * Creating safeguards to ensure that a rolling restart is safe to perform
> and does not interfere with other operations ongoing in the cluster such as
> node bootstraps or decommissions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]