Sam Wheating created SPARK-52358:
------------------------------------

             Summary: spark-submit returns non-zero exit code after successful 
driver pod creation
                 Key: SPARK-52358
                 URL: https://issues.apache.org/jira/browse/SPARK-52358
             Project: Spark
          Issue Type: Bug
          Components: Kubernetes
    Affects Versions: 3.1.1
            Reporter: Sam Wheating


We encountered what looks like a rare failure mode today (first time we have 
seen this after submitting thousands of jobs to our kubernetes cluster per day) 
in which the spark-submit command returned a non-zero exit code, but the driver 
pod was successfully created.

For context, we are using the kubeflow spark-operator which shells out to 
`spark-submit` and then watches the resulting driver pod for status.

It looks like there was maybe a networking issue between the spark-submit 
process and the kube api-server, which caused an exception as spark-submit was 
watching the pod, which resulted in throwing an exception and returning a 
non-zero exit code.

Full error log below:
{code:java}
25/05/30 09:45:40 WARN WatchConnectionManager: Exec Failure
java.net.SocketTimeoutException: timeout
    at okio.Okio$4.newTimeoutException(Okio.java:232)
    at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
    at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
        at 
okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
        at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
        at 
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
        at 
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: javax.net.ssl.SSLException: Socket closed
    at java.base/sun.security.ssl.Alert.createSSLException(Unknown Source)
    at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
        at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
    at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl.handleException(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown 
Source)
    at okio.Okio$2.read(Okio.java:140)
    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        ... 35 more
Caused by: java.net.SocketException: Socket closed
    at java.base/java.net.SocketInputStream.read(Unknown Source)
    at java.base/java.net.SocketInputStream.read(Unknown Source)
        at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown 
Source)
        at 
java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown 
Source)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown 
Source)
    ... 38 more
Exception in thread \"main\" 
io.fabric8.kubernetes.client.KubernetesClientException: Failed to start 
websocket
    at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onFailure(WatchConnectionManager.java:208)
        at 
okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571)
    at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:221)
        at okhttp3.RealCall$AsyncCall.execute(RealCall.java:211)
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
    Suppressed: java.lang.Throwable: waiting here
        at 
io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:154)
            at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.waitUntilReady(WatchConnectionManager.java:341)
            at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:818)
            at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:791)
            at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.watch(BaseOperation.java:84)
            at 
org.apache.spark.deploy.k8s.submit.Client.$anonfun$run$1(KubernetesClientApplication.scala:157)
            at scala.util.control.Breaks.breakable(Breaks.scala:42)
        at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:151)
            at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
            at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
            at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
        at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
            at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
            at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
            at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
            at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
            at 
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
            at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
            at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.SocketTimeoutException: timeout
    at okio.Okio$4.newTimeoutException(Okio.java:232)
    at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
    at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355)
        at 
okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227)
        at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215)
        at 
okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
        at 
okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:135)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.OIDCTokenRefreshInterceptor.intercept(OIDCTokenRefreshInterceptor.java:41)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:151)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at 
okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)
    ... 4 more
Caused by: javax.net.ssl.SSLException: Socket closed
    at java.base/sun.security.ssl.Alert.createSSLException(Unknown Source)
    at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
        at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
    at java.base/sun.security.ssl.TransportContext.fatal(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl.handleException(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(Unknown 
Source)
    at okio.Okio$2.read(Okio.java:140)
    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
        ... 35 more
Caused by: java.net.SocketException: Socket closed
    at java.base/java.net.SocketInputStream.read(Unknown Source)
    at java.base/java.net.SocketInputStream.read(Unknown Source)
        at java.base/sun.security.ssl.SSLSocketInputRecord.read(Unknown Source)
    at java.base/sun.security.ssl.SSLSocketInputRecord.readHeader(Unknown 
Source)
        at 
java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(Unknown 
Source)
    at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(Unknown 
Source)
    ... 38 more
25/05/30 09:45:40 INFO ShutdownHookManager: Shutdown hook called
25/05/30 09:45:40 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-c44f3062-d665-44b3-a36d-978abc375b83 {code}
However, the driver pod started up correctly and the job ran to completion.

This caused issues as the SparkOperator had identified the job as failed, so it 
was automatically re-submitted resulting in a duplicate run.

I understand that this is a super old version of Spark, but I haven't seen 
anything in the issue trackers or changelogs for spark or fabric8 which 
indicate that this issue has been fixed, so I suspect that the issue might 
still occur on newer versions.

I will follow up here with additional findings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to