milenkovicm commented on code in PR #1212: URL: https://github.com/apache/datafusion-ballista/pull/1212#discussion_r2020121044
########## ballista/scheduler/src/scheduler_server/grpc.rs: ########## @@ -124,14 +128,36 @@ impl<T: 'static + AsLogicalPlan, U: 'static + AsExecutionPlan> SchedulerGrpc }; let mut tasks = vec![]; + let mut prepare_failed_jobs = HashMap::<String, Vec<TaskDescription>>::new(); for (_, task) in schedulable_tasks { - match self.state.task_manager.prepare_task_definition(task) { + let job_id = task.partition.job_id.clone(); + if prepare_failed_jobs.contains_key(&job_id) { + prepare_failed_jobs.entry(job_id).or_default().push(task); + continue; + } + match self + .state + .task_manager + .prepare_task_definition(task.clone()) + { Ok(task_definition) => tasks.push(task_definition), Err(e) => { error!("Error preparing task definition: {:?}", e); + prepare_failed_jobs.entry(job_id).or_default().push(task); } } } + + unbind_prepare_failed_tasks(active_jobs, &prepare_failed_jobs).await; Review Comment: this issue captures very rare corner case, which should not happen in properly configured cluster. for the sake of simplicity and understanding can we should just cancel the job (if cluster state is consistent at the end) If the consequence of canceling failed task is error log it may not be too big of a problem. what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org