hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on
the kubelet nodes did you notice any pressure on the DNS side?

Li


On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
> and sometimes while running these jobs a pretty bad thing happens, the
> driver (in cluster mode) gets scheduled on Kubernetes and launches many
> executor pods.
> So far so good, but the k8s "Service" associated to the driver does not
> seem to be propagated in terms of DNS resolution so all the executor fails
> with a "spark-application-......cluster.svc.local" does not exists.
>
> All executors failing the driver should be failing too, but it considers
> that it's a "pending" initial allocation and stay stuck forever in a loop
> of "Initial job has not accepted any resources, please check Cluster UI"
>
> Has anyone else observed this king of behaviour ?
> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
> exist even after the "big refactoring" in the kubernetes cluster scheduler
> backend.
>
> I can work on a fix / workaround but I'd like to check with you the proper
> way forward :
>
>    - Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>    before launching the dependent pods (that could be added to
>    /opt/entrypoint.sh used in the kubernetes packing)
>    - We can add a simple step to the init container trying to do the DNS
>    resolution and failing after 60s if it did not work
>
> But these steps won't change the fact that the driver will stay stuck
> thinking we're still in the case of the Initial allocation delay.
>
> Thoughts ?
>
> --
> *Olivier Girardot*
> o.girar...@lateral-thoughts.com
>

Reply via email to