stefan-egli commented on code in PR #12: URL: https://github.com/apache/sling-org-apache-sling-discovery-base/pull/12#discussion_r2056258606
########## src/main/java/org/apache/sling/discovery/base/commons/BaseDiscoveryService.java: ########## @@ -93,7 +99,37 @@ public TopologyView getTopology() { .listInstances(localClusterView); topology.addInstances(attachedInstances); + // Check if topology changes should be delayed + if (topologyReadinessHandler != null && topologyReadinessHandler.shouldDelayTopologyChange(null)) { + logger.debug("getTopology: topology changes are delayed, returning old view"); + return oldView; + } + return topology; } + protected void handleTopologyEvent(TopologyEvent event) { Review Comment: See comment below about making TopologyReadinessHandler itself a TopologyEventListener - if that's not possible, then one open question might be how to register this TopologyReadinessHandler as a TopologyEventListener. At the moment noone calls its handleTopologyEvent yet. This registration needs to be carefully done as it sort of creates a circular dependency between DiscoveryService and TopologyReadinessHandler... ########## src/main/java/org/apache/sling/discovery/base/commons/BaseDiscoveryService.java: ########## @@ -93,7 +99,37 @@ public TopologyView getTopology() { .listInstances(localClusterView); topology.addInstances(attachedInstances); + // Check if topology changes should be delayed + if (topologyReadinessHandler != null && topologyReadinessHandler.shouldDelayTopologyChange(null)) { + logger.debug("getTopology: topology changes are delayed, returning old view"); + return oldView; + } + return topology; } + protected void handleTopologyEvent(TopologyEvent event) { + if (event == null) { + return; + } + + if (topologyReadinessHandler != null) { + if (topologyReadinessHandler.shouldDelayTopologyChange(event)) { + logger.debug("handleTopologyEvent: delaying topology event: {}", event); + return; + } + + if (event.getType() == Type.TOPOLOGY_CHANGING) { + topologyReadinessHandler.startTopologyChange(); + } else if (event.getType() == Type.TOPOLOGY_CHANGED) { + topologyReadinessHandler.endTopologyChange(); + } + } + + // Update old view when topology changes + if (event.getType() == Type.TOPOLOGY_CHANGED && event.getNewView() != null) { + setOldView((DefaultTopologyView) event.getNewView()); + } Review Comment: What is the intention of this part? It looks problematic, it is not the expectation that a TopologyEventListener modifies the DiscoveryService's state - and it is actually already called by subclasses of BaseDiscoveryService. I think this is not needed (unless I'm missing something)... ########## src/main/java/org/apache/sling/discovery/base/commons/TopologyReadinessHandler.java: ########## @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.sling.discovery.base.commons; + +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.atomic.AtomicReference; + +import org.apache.felix.hc.api.condition.SystemReady; +import org.apache.sling.discovery.DiscoveryService; +import org.apache.sling.discovery.TopologyEvent; +import org.apache.sling.discovery.TopologyView; +import org.osgi.service.component.annotations.Activate; +import org.osgi.service.component.annotations.Component; +import org.osgi.service.component.annotations.Deactivate; +import org.osgi.service.component.annotations.Reference; +import org.osgi.service.component.annotations.ReferenceCardinality; +import org.osgi.service.component.annotations.ReferencePolicy; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.osgi.service.component.ComponentContext; + +/** + * Coordinates topology changes based on system readiness state. + * This component ensures that topology changes only occur when the system is in a stable state, + * both during startup and shutdown sequences. + * + * The handler manages three main states: + * 1. STARTUP: Initial state when the system is starting up + * 2. READY: System is ready for normal operation and topology changes + * 3. SHUTDOWN: System is in the process of shutting down + * + * State Transitions: + * - System starts in STARTUP state + * - Transitions to READY state only when SystemReady service is bound + * - Transitions to SHUTDOWN state when: + * * SystemReady service is unbound + * * Component is deactivated + * + * Note: This component requires the Felix SystemReady service to function properly. + * The system will remain in STARTUP state until SystemReady service is bound, + * and will transition to SHUTDOWN state when the service is unbound. + */ +@Component(service = TopologyReadinessHandler.class, immediate = true) +public class TopologyReadinessHandler { + + private final Logger logger = LoggerFactory.getLogger(this.getClass()); + + /** + * Represents the possible states of the system. + * State transitions are controlled by SystemReady service binding/unbinding + * and component lifecycle events. + */ + private enum SystemState { + STARTUP, // Initial state, waiting for SystemReady service + READY, // System is ready for normal operation + SHUTDOWN // System is shutting down + } + + private final AtomicReference<SystemState> systemState = new AtomicReference<>(SystemState.STARTUP); + private final AtomicLong lastTopologyChangeTime = new AtomicLong(0); + private final AtomicBoolean topologyChangeInProgress = new AtomicBoolean(false); + + private long delayDuration = 5000; // Default 5 second delay between topology changes + private long shutdownTimeout = 30000; // Default 30 second shutdown timeout + + @Reference(cardinality = ReferenceCardinality.MANDATORY, policy = ReferencePolicy.STATIC) + private volatile SystemReady systemReady; + + @Reference + private DiscoveryService discoveryService; + + @Activate + protected void activate(ComponentContext context) { + logger.info("TopologyReadinessHandler activated - entering STARTUP state"); + systemState.set(SystemState.STARTUP); + } + + @Deactivate + protected void deactivate(ComponentContext context) { + logger.info("TopologyReadinessHandler deactivated"); + initiateShutdown(); + } + + protected void bindSystemReady(SystemReady service) { + logger.debug("SystemReady service bound - transitioning to READY state"); + if (systemState.compareAndSet(SystemState.STARTUP, SystemState.READY)) { + logger.info("System state changed to READY"); + } + } + + protected void unbindSystemReady(SystemReady service) { + logger.debug("SystemReady service unbound - initiating shutdown"); + initiateShutdown(); + } + + /** + * Initiate the shutdown process + */ + protected void initiateShutdown() { + if (systemState.compareAndSet(SystemState.READY, SystemState.SHUTDOWN) || + systemState.compareAndSet(SystemState.STARTUP, SystemState.SHUTDOWN)) { + logger.info("Initiating shutdown process"); + + // Mark current view as not current + if (discoveryService != null) { + TopologyView currentView = discoveryService.getTopology(); + if (currentView instanceof DefaultTopologyView) { + logger.info("Marking current topology view as not current during shutdown"); + ((DefaultTopologyView) currentView).setNotCurrent(); + } + } + + // If shutdown timeout is disabled, don't wait + if (shutdownTimeout <= 0) { + return; + } + + // Wait for running jobs to complete or timeout + long startTime = System.currentTimeMillis(); + while (topologyChangeInProgress.get() && + (System.currentTimeMillis() - startTime) < shutdownTimeout) { + try { + Thread.sleep(100); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + break; + } + } + + if (topologyChangeInProgress.get()) { + logger.warn("Shutdown timeout reached while waiting for topology changes to complete"); + } else { + logger.info("Shutdown completed successfully"); + } + } + } + + /** + * Set the shutdown timeout in milliseconds + * @param timeout the shutdown timeout in milliseconds + */ + public void setShutdownTimeout(long timeout) { + this.shutdownTimeout = timeout; + } + + /** + * Set the delay duration in milliseconds + * @param delayDuration the delay duration in milliseconds + */ + public void setDelayDuration(long delayDuration) { + this.delayDuration = delayDuration; + } + + /** + * Check if a topology change should be delayed based on system readiness + * @param event the topology event + * @return true if the change should be delayed, false otherwise + */ + public boolean shouldDelayTopologyChange(TopologyEvent event) { Review Comment: Didn't dig into this code too much, but one thing I'm wondering: the design of these delayings of TopologyEvent seems to base on a very important assumption : that we go through the following phases: 1. we are starting up and not ready yet => delay 2. startup is finished and we are ready => no-delay 3. we are shutting down => delay We should only ever go from 1. to 2., then from 2. to 3. There should be no other way. It shouldn't be possible to go from 2. back to 1 or from 3 back to 2 etc. If that were required, then the design doesn't work properly. So if we're saying that it is exactly that order we can go through : in that case I'd suggest to actually write code to reflect exactly that. To have a little state machine that can go through exactly those 3 states and none other. That can then also be very well unit-tested etc. On top of that little state machine would then come the hooks into activate/deactivate/bind/unbind/etc etc. Those could then call into the state machine to make it do one of the only 2 possible transitions (and fail in any other case). I think such an intermediate state machine would make the code more robust and clearer. At the moment it would seem that it is more coincidence that only those 2 transitions are possible. It's not something that's clear from the code... ########## src/main/java/org/apache/sling/discovery/base/commons/BaseDiscoveryService.java: ########## @@ -93,7 +99,37 @@ public TopologyView getTopology() { .listInstances(localClusterView); topology.addInstances(attachedInstances); + // Check if topology changes should be delayed + if (topologyReadinessHandler != null && topologyReadinessHandler.shouldDelayTopologyChange(null)) { + logger.debug("getTopology: topology changes are delayed, returning old view"); + return oldView; + } + return topology; } + protected void handleTopologyEvent(TopologyEvent event) { + if (event == null) { + return; + } + + if (topologyReadinessHandler != null) { + if (topologyReadinessHandler.shouldDelayTopologyChange(event)) { + logger.debug("handleTopologyEvent: delaying topology event: {}", event); + return; + } + + if (event.getType() == Type.TOPOLOGY_CHANGING) { + topologyReadinessHandler.startTopologyChange(); + } else if (event.getType() == Type.TOPOLOGY_CHANGED) { + topologyReadinessHandler.endTopologyChange(); + } Review Comment: What about making TopologyReadinessHandler itself a TopologyEventListener? Then this indirection is not needed and things become a bit more straight forward.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sling.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org