westhide commented on code in PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#discussion_r2020084368
########## ballista/scheduler/src/scheduler_server/grpc.rs: ########## @@ -124,14 +128,36 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerGrpc }; let mut tasks = vec![]; + let mut prepare_failed_jobs = HashMap::<String, Vec<TaskDescription>>::new(); for (_, task) in schedulable_tasks { - match self.state.task_manager.prepare_task_definition(task) { + let job_id = task.partition.job_id.clone(); + if prepare_failed_jobs.contains_key(&job_id) { + prepare_failed_jobs.entry(job_id).or_default().push(task); + continue; + } + match self + .state + .task_manager + .prepare_task_definition(task.clone()) + { Ok(task_definition) => tasks.push(task_definition), Err(e) => { error!("Error preparing task definition: {:?}", e); + prepare_failed_jobs.entry(job_id).or_default().push(task); } } } + + unbind_prepare_failed_tasks(active_jobs, &prepare_failed_jobs).await; Review Comment: when task_manager execute `prepare_task_definition`, it will set `task_info` for running_stage, without this `unbind_prepare_failed_tasks` function to reset the task_info to `None`, when Scheduler try to cancel the job, it will try to send a stop task event to Executor, that will cause a task stop fail error log on the Executor side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org