hi Olivier, This seems a GKE specific issue? have you tried on other vendors ? Also on the kubelet nodes did you notice any pressure on the DNS side?
Li On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler, > and sometimes while running these jobs a pretty bad thing happens, the > driver (in cluster mode) gets scheduled on Kubernetes and launches many > executor pods. > So far so good, but the k8s "Service" associated to the driver does not > seem to be propagated in terms of DNS resolution so all the executor fails > with a "spark-application-......cluster.svc.local" does not exists. > > All executors failing the driver should be failing too, but it considers > that it's a "pending" initial allocation and stay stuck forever in a loop > of "Initial job has not accepted any resources, please check Cluster UI" > > Has anyone else observed this king of behaviour ? > We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to > exist even after the "big refactoring" in the kubernetes cluster scheduler > backend. > > I can work on a fix / workaround but I'd like to check with you the proper > way forward : > > - Some processes (like the airflow helm recipe) rely on a "sleep 30s" > before launching the dependent pods (that could be added to > /opt/entrypoint.sh used in the kubernetes packing) > - We can add a simple step to the init container trying to do the DNS > resolution and failing after 60s if it did not work > > But these steps won't change the fact that the driver will stay stuck > thinking we're still in the case of the Initial allocation delay. > > Thoughts ? > > -- > *Olivier Girardot* > o.girar...@lateral-thoughts.com >