Sam Wheating created SPARK-52358: ------------------------------------ Summary: spark-submit returns non-zero exit code after successful driver pod creation Key: SPARK-52358 URL: https://issues.apache.org/jira/browse/SPARK-52358 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.1 Reporter: Sam Wheating
We encountered what looks like a rare failure mode today (first time we have seen this after submitting thousands of jobs to our kubernetes cluster per day) in which the spark-submit command returned a non-zero exit code, but the driver pod was successfully created. For context, we are using the kubeflow spark-operator which shells out to `spark-submit` and then watches the resulting driver pod for status. It looks like there was maybe a networking issue between the spark-submit process and the kube api-server, which caused an exception as spark-submit was watching the pod, which resulted in throwing an exception and returning a non-zero exit code. Full error log below: {code:java} 25/05/30 09:45:40 WARN WatchConnectionManager: Exec Failure java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:232) at okio.AsyncTimeout.exit(AsyncTimeout.java:285) at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: javax.net.ssl.SSLException: Socket closed at java.base/sun.security.ssl.Alert.createSSLException(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl.handleException(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source) at okio.Okio$2.read(Okio.java:140) at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ... 35 more Caused by: java.net.SocketException: Socket closed at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source) ... 38 more Exception in thread \"main\" io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:208) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571) at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:221) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:211) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Suppressed: java.lang.Throwable: waiting here at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:154) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.waitUntilReady(WatchConnectionManager.java:341) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:818) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:791) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:84) at org.apache.spark.deploy.k8s.submit.Client.$anonfun$run$1(KubernetesClientApplication.scala:157) at scala.util.control.Breaks.breakable(Breaks.scala:42) at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:151) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207) at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207) at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.net.SocketTimeoutException: timeout at okio.Okio$4.newTimeoutException(Okio.java:232) at okio.AsyncTimeout.exit(AsyncTimeout.java:285) at okio.AsyncTimeout$2.read(AsyncTimeout.java:241) at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201) ... 4 more Caused by: javax.net.ssl.SSLException: Socket closed at java.base/sun.security.ssl.Alert.createSSLException(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl.handleException(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown Source) at okio.Okio$2.read(Okio.java:140) at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ... 35 more Caused by: java.net.SocketException: Socket closed at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/java.net.SocketInputStream.read(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown Source) at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown Source) at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown Source) ... 38 more 25/05/30 09:45:40 INFO ShutdownHookManager: Shutdown hook called 25/05/30 09:45:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-c44f3062-d665-44b3-a36d-978abc375b83 {code} However, the driver pod started up correctly and the job ran to completion. This caused issues as the SparkOperator had identified the job as failed, so it was automatically re-submitted resulting in a duplicate run. I understand that this is a super old version of Spark, but I haven't seen anything in the issue trackers or changelogs for spark or fabric8 which indicate that this issue has been fixed, so I suspect that the issue might still occur on newer versions. I will follow up here with additional findings. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org