gianm commented on code in PR #18254:
URL: https://github.com/apache/druid/pull/18254#discussion_r2216955916
##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java:
##########
@@ -1379,13 +1377,20 @@ private void contactWorkersForStage(
final boolean retryOnFailure
)
{
- // Sorted copy of target worker numbers to ensure consistent iteration
order.
+ // Sorted copy of target worker numbers to ensure a consistent iteration
order.
final List<Integer> workersCopy = Ordering.natural().sortedCopy(workers);
final List<String> workerIds = getWorkerIds();
final List<ListenableFuture<Void>> workerFutures = new
ArrayList<>(workersCopy.size());
try {
- workerManager.waitForWorkers(workers);
+ workerManager.waitForWorkers(
+ workers,
+ (workerTask, fault) -> {
+ throwIfNonRetriableFault(fault);
+ // no need to add it to the kernel manipulation queue since this
is the main controller thread calling this function.
+ addToRetryQueue(queryKernel, workerTask.getWorkerNumber(), fault);
Review Comment:
Is the scenario something like this?
- Worker fails at some point during a controller run (while `runInternal` is
in the midst of executing)
- `MSQWorkerTaskLauncher` notices the failure in its main loop, which runs
in another thread
- `MSQWorkerTaskLauncher` removes the worker from `fullyStartedTasks` and
calls `WorkerFailureListener#onFailure`. This queues up `addToRetryQueue` to
run in the controller thread
- Controller calls `workerManager.waitForWorkers` during that same
`runInternal` run
- The call to `workerManager.waitForWorkers` will never return because the
`onFailure` listener has not actually run yet, it's still just queued up. It is
waiting for a worker that has already failed and will never be relaunched
And I believe your fix makes it so the controller call to
`workerManager.waitForWorkers` actually calls into `addToRetryQueue` (through
`WorkerFailureListener#onFailure`). That call to `addToRetryQueue` causes a
relaunch of the failed worker, so `workerManager.waitForWorkers` can eventually
return normally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]