guyinyou opened a new issue, #9701: URL: https://github.com/apache/rocketmq/issues/9701
### Before Creating the Enhancement Request - [x] I have confirmed that this should be classified as an enhancement rather than a bug/feature. ### Summary Add synchronous blocking wait mechanism for metrics components shutdown to prevent JVM crashes caused by race conditions during broker shutdown process. ### Motivation Currently, the metrics shutdown process in BrokerMetricsManager uses asynchronous operations without proper synchronization. This creates race conditions where: 1. Dependencies (like periodicMetricReader, metricExporter) may shutdown before the services that depend on them 2. Services continue to access already-shutdown dependencies, causing JVM crashes 3. Data loss may occur due to incomplete flush operations during shutdown This enhancement is critical for production stability, as JVM crashes during broker shutdown can lead to: - Data corruption - Incomplete metrics export - Service unavailability - Difficult troubleshooting in production environments The enhancement benefits the entire RocketMQ community by ensuring graceful and reliable broker shutdowns, especially in high-throughput production environments where metrics collection is heavily utilized. ### Describe the Solution You'd Like Implement synchronous blocking wait for all metrics-related shutdown operations in BrokerMetricsManager.shutdown(): 1. **Replace async calls with sync blocking**: Convert all shutdown operations to use CompletableFuture.join() with appropriate timeout 2. **Ensure proper shutdown order**: Force each component to complete shutdown before proceeding to the next 3. **Add retry mechanism**: Use while loops to retry failed operations until successful 4. **Apply to all exporter types**: Implement the fix for OTLP_GRPC, PROM, and LOG metrics exporters **Implementation details:** - Use `join(Integer.MAX_VALUE, TimeUnit.DAYS)` to ensure completion - Add `isSuccess()` checks to verify operation completion - Maintain the same shutdown sequence but with proper synchronization - Ensure forceFlush() completes before shutdown() for each component **Code changes:** ```java // Before (async - causes race conditions) periodicMetricReader.forceFlush(); periodicMetricReader.shutdown(); // After (sync - prevents race conditions) while (periodicMetricReader.forceFlush().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess()); while (periodicMetricReader.shutdown().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess()); ``` ``` ## Describe Alternatives You've Considered ### Describe Alternatives You've Considered 1. **Add shutdown hooks**: Considered using JVM shutdown hooks, but this doesn't solve the core race condition issue and may introduce additional complexity. 2. **Implement timeout-based shutdown**: Instead of infinite wait, use configurable timeouts. However, this could lead to incomplete shutdowns in slow environments and doesn't address the fundamental synchronization issue. 3. **Add dependency tracking**: Track component dependencies and shutdown in reverse dependency order. This would be more complex and doesn't guarantee that async operations complete before dependencies are accessed. 4. **Use CountDownLatch or similar synchronization primitives**: While this could work, CompletableFuture.join() is more appropriate for this use case as it's already part of the async operation chain. The chosen solution (synchronous blocking wait) is the most straightforward and reliable approach that directly addresses the root cause of the race condition without introducing unnecessa ### Additional Context **Current Issue:** - Broker shutdown process has race conditions in metrics components - JVM crashes occur when services access already-shutdown dependencies - Affects all metrics exporter types (OTLP_GRPC, PROM, LOG) **Environment:** - RocketMQ 5.3.2-SNAPSHOT - Java 8+ environments - Production environments with high metrics throughput **Testing:** - The fix has been implemented and tested locally - Commit: 5cd58a537f - "fix: synchronize metrics shutdown to prevent JVM crash" - No breaking changes to existing APIs - Backward compatible with existing configurations **Related Components:** - `org.apache.rocketmq.broker.metrics.BrokerMetricsManager` - Metrics exporters (OTLP, Prometheus, Logging) - Periodic metric readers This enhancement is essential for production stability and should be prioritized for the next releas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
