hubcio commented on code in PR #3233:
URL: https://github.com/apache/iggy/pull/3233#discussion_r3225762509


##########
core/server/src/shard/mod.rs:
##########
@@ -194,6 +199,10 @@ impl IggyShard {
         // Spawn shutdown handler
         compio::runtime::spawn(async move {
             let _ = stop_receiver.recv().await;
+            #[cfg(feature = "systemd")]
+            if shard_for_shutdown.id == 0 {
+                let _ = sd_notify::notify(&[sd_notify::NotifyState::Stopping]);

Review Comment:
   raw `sd_notify::notify(...)` inline here for `STOPPING=1`, and again in 
`core/server/src/shard/tasks/oneshot/config_writer.rs` for `READY=1`. meanwhile 
`core/ai/mcp/src/systemd.rs` wraps the same calls in a typed module 
(`notify_ready` / `notify_stopping` / `spawn_watchdog`). same project, two 
patterns -- a new contributor cannot grep for "where do systemd calls live in 
iggy" and get one answer.
   
   extract `core/server/src/shard/systemd.rs` mirroring the MCP module shape 
and route all server-side calls through it. ~30 LoC refactor. fine as a 
follow-up PR if you want to keep this one small.



##########
core/server/src/shard/tasks/oneshot/config_writer.rs:
##########
@@ -135,5 +135,10 @@ async fn write_config(
 
     info!("Current config written and synced to: {config_path} with all bound 
addresses",);
 
+    #[cfg(feature = "systemd")]
+    if let Err(e) = sd_notify::notify(&[sd_notify::NotifyState::Ready]) {

Review Comment:
   `READY=1` is silently lost on any failure of `write_config()` above this 
line. the task is registered with `.critical(false)` (around line 32), and `?` 
propagates errors from `open` / `write_all_at` / `sync_all` (around lines 
110-134) before reaching this notify. so any of {runtime dir missing, disk 
full, AppArmor block, read-only fs, custom path override} causes a healthy, 
fully-bound server to be killed by systemd's `TimeoutStartSec=` because 
`READY=1` never fires. `task_registry` silently drops non-critical task errors, 
so there is no log line that screams about it either.
   
   fix: move the notify to right after listener-bind completes (the `if 
tcp_ready && quic_ready && http_ready && websocket_ready { break; }` block 
earlier in this function), before any fallible disk i/o. readiness is 
"listeners are bound", not "config file is fsynced". alternative: set 
`.critical(true)` on the oneshot so the task tears down with an error rather 
than silently dropping `READY=1`.
   
   the MCP seems fine - `core/ai/mcp/src/main.rs:107` emits `READY=1` 
immediately after the synchronous `serve(stdio())` returns with no fallible i/o 
in between.



##########
core/ai/mcp/src/systemd.rs:
##########
@@ -0,0 +1,50 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use tracing::{info, warn};
+
+pub fn notify_ready() {
+    if let Err(e) = sd_notify::notify(&[sd_notify::NotifyState::Ready]) {
+        warn!("Failed to send systemd READY=1 notification: {e}");
+    }
+}
+
+pub fn notify_stopping() {
+    let _ = sd_notify::notify(&[sd_notify::NotifyState::Stopping]);
+}
+
+pub fn spawn_watchdog() {
+    let Some(timeout) = sd_notify::watchdog_enabled() else {
+        return;
+    };
+
+    let interval = timeout / 2;
+    info!(
+        "Systemd watchdog enabled, pinging every {}s (timeout: {}s).",
+        interval.as_secs(),
+        timeout.as_secs()
+    );
+
+    tokio::spawn(async move {

Review Comment:
   `tokio::spawn` infinite loop, `JoinHandle` dropped, no cancellation. not a 
true leak -- the process exits on `client_to_shutdown.shutdown().await` in 
`main.rs` and the task gets reaped -- but it is inconsistent with the server 
side at `core/server/src/shard/tasks/periodic/systemd_watchdog.rs` which goes 
through `task_registry.periodic(...)` with cooperative `ShutdownToken` wiring.
   
   easy follow-up: accept a `CancellationToken` (from `tokio_util`) driven by 
the existing SIGTERM `select!` in `main.rs`, and replace `loop { sleep; ping }` 
with `select! { _ = token.cancelled() => break, _ = sleep => ping }`. not 
blocking.



##########
core/server/src/shard/mod.rs:
##########
@@ -194,6 +199,10 @@ impl IggyShard {
         // Spawn shutdown handler
         compio::runtime::spawn(async move {
             let _ = stop_receiver.recv().await;
+            #[cfg(feature = "systemd")]
+            if shard_for_shutdown.id == 0 {
+                let _ = sd_notify::notify(&[sd_notify::NotifyState::Stopping]);
+            }
             shard_for_shutdown.trigger_shutdown().await;

Review Comment:
   `trigger_shutdown()`'s return value is discarded. 
`task_registry.graceful_shutdown(SHUTDOWN_TIMEOUT)` returns `false` if the 
timeout fires before all tasks drain (pending fsyncs, segment rotation, replica 
notifications), but systemd just sees a clean `STOPPING=1` + normal exit and 
journalctl logs "service stopped successfully" -- even when the 30s shutdown 
timeout dropped pending writes on the floor.
   
   minor observability gap: forward a `STATUS=shutdown timed out` (or 
`sd_notify::NotifyState::Errno(...)`) when graceful shutdown returns false. not 
blocking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to