gianm commented on code in PR #19139:
URL: https://github.com/apache/druid/pull/19139#discussion_r2955054003


##########
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/DefaultK8sApiClient.java:
##########
@@ -174,11 +201,54 @@ public boolean hasNext() throws SocketTimeoutException
               Watch.Response<V1Pod> item = watch.next();
               if (item != null && item.type != null && 
!item.type.equals(WatchResult.BOOKMARK)) {
                 DiscoveryDruidNodeAndResourceVersion result = null;
+                String effectiveType = item.type;
+
                 if (item.object != null) {
-                  result = new DiscoveryDruidNodeAndResourceVersion(
-                    item.object.getMetadata().getResourceVersion(),
-                    getDiscoveryDruidNodeFromPodDef(nodeRole, item.object)
-                  );
+                  if (!isPodReady(item.object)) {
+                    if (WatchResult.MODIFIED.equals(item.type)) {
+                      // Pod was previously ready but is now unready (e.g., 
OOM-killed container).
+                      // Remap to NOT_READY to ensure the host is removed from 
discovery cache if is cached
+                      LOGGER.info(
+                          "Pod[%s] for role[%s] notified that it was modified 
and is now showing as not ready, "
+                          + "treating as removed for discovery purposes.",
+                          item.object.getMetadata().getName(),
+                          nodeRole
+                      );
+                      effectiveType = WatchResult.NOT_READY;
+                    } else if (WatchResult.ADDED.equals(item.type)) {
+                      // Pod is not ready yet (e.g., still starting up). Skip 
this event entirely.
+                      // It will appear via a MODIFIED event that remaps to 
ADDED for dicovery, once it becomes ready.

Review Comment:
   discovery (spelling)



##########
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java:
##########
@@ -26,9 +26,18 @@
 public interface WatchResult

Review Comment:
   This class used to be a relatively thin layer over k8s watches, now there's 
remapping and synthetic events going on. The javadoc of this class should make 
clear that it's not meant to be a thin layer over k8s watches, it's meant to be 
aligned with the needs of service discovery.



##########
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java:
##########
@@ -273,6 +273,11 @@ private void keepWatching(String labelSelector, String 
resourceVersion)
                   case WatchResult.DELETED:
                     baseNodeRoleWatcher.childRemoved(item.object.getNode());

Review Comment:
   Can `DELETED` fire after `NOT_READY` (or even before `ADDED`, if a pod is 
deleted before it ever becomes ready)? I think the logging in those cases will 
end up noisy, since `childRemoved` complains loudly if the service doesn't 
exist. You added a `skipIfUnknown` that is used when `NOT_READY` fires, but I 
wonder if now we should always act like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to