cryptoe commented on code in PR #18254:
URL: https://github.com/apache/druid/pull/18254#discussion_r2214906895


##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java:
##########
@@ -1379,13 +1377,20 @@ private void contactWorkersForStage(
       final boolean retryOnFailure
   )
   {
-    // Sorted copy of target worker numbers to ensure consistent iteration 
order.
+    // Sorted copy of target worker numbers to ensure a consistent iteration 
order.
     final List<Integer> workersCopy = Ordering.natural().sortedCopy(workers);
     final List<String> workerIds = getWorkerIds();
     final List<ListenableFuture<Void>> workerFutures = new 
ArrayList<>(workersCopy.size());
 
     try {
-      workerManager.waitForWorkers(workers);
+      workerManager.waitForWorkers(
+          workers,
+          (workerTask, fault) -> {
+            throwIfNonRetriableFault(fault);
+            // no need to add it to the kernel manipulation queue since this 
is the main controller thread calling this function.
+            addToRetryQueue(queryKernel, workerTask.getWorkerNumber(), fault);

Review Comment:
   There are 2 blocking calls in the `workerManager` which is called by the 
main controller thread : 
   
   ```
   >  workerManager.launchWorkersIfNeeded(
   >  workerManager.waitForWorkers(
   ```
   
   In case the worker spawned fails for any reason , the worker manager thread 
calls the failure listener : 
https://github.com/apache/druid/blob/aa29d43eac515905836a9641eaa83f074e2cc8c0/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQWorkerTaskLauncher.java#L615
 which adds the retry event in the controller kernel queue. 
   
   Since we need access to work orders, we can only access that information via 
the controller thread due to the single threaded nature of the query kernel. 
   
   Since the main controller thread is blocked on `launchingTheWorker`,` 
waitingForWorker` to launch, we never really launch the worker since the retry 
event is never processed. 
   
   
   
   Let me try to find a way to pass the failure listener as part of the 
constructor
   
    
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to