[ 
https://issues.apache.org/jira/browse/FLINK-25316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459683#comment-17459683
 ] 

Robert Metzger commented on FLINK-25316:
----------------------------------------

I reverted the commit from FLINK-24156, but I'm still facing the same issue.

{code}
"AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #94 daemon 
prio=5 os_prio=0 tid=0x0000004017d57000 nid=0x2f1 in Object.wait() 
[0x000000402671e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000d6bba0f8> (a 
org.apache.flink.runtime.blob.BlobServer)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000000d6bba0f8> (a 
org.apache.flink.runtime.blob.BlobServer)
        at java.lang.Thread.join(Thread.java:1326)
        at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:318)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406)
        - locked <0x00000000d5d27630> (a java.lang.Object)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505)
        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint$$Lambda$1102/855423197.get(Unknown
 Source)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$composeAfterwards$20(FutureUtils.java:728)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1085/270874580.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$null$19(FutureUtils.java:739)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$1094/602149412.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.lambda$closeAsyncInternal$2(DispatcherResourceManagerComponent.java:198)
        at 
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent$$Lambda$1129/841631053.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture.completeFuture(FutureUtils.java:1000)
        - locked <0x00000000c1e39828> (a java.lang.Object)
        at 
org.apache.flink.util.concurrent.FutureUtils$CompletionConjunctFuture$$Lambda$528/169049466.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$575/1712666248.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at 
org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1389)
        at 
org.apache.flink.util.concurrent.FutureUtils.lambda$forwardTo$24(FutureUtils.java:1372)
        at 
org.apache.flink.util.concurrent.FutureUtils$$Lambda$575/1712666248.accept(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$571/1252843198.run(Unknown
 Source)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41)
        at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils$$Lambda$569/571928572.run(Unknown
 Source)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

"BLOB Server listener at 6124" #29 daemon prio=5 os_prio=0 
tid=0x000000401e30d800 nid=0x2b3 runnable [0x0000004025ff8000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
        at java.net.ServerSocket.implAccept(ServerSocket.java:560)
        at java.net.ServerSocket.accept(ServerSocket.java:528)
        at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:267)
{code}


> BlobServer can get stuck during shutdown
> ----------------------------------------
>
>                 Key: FLINK-25316
>                 URL: https://issues.apache.org/jira/browse/FLINK-25316
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Robert Metzger
>            Priority: Critical
>             Fix For: 1.15.0
>
>
> The cluster shutdown can get stuck
> {code}
> "AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon 
> prio=5 os_prio=0 tid=0x0000004017d70000 nid=0x2ec in Object.wait() 
> [0x000000402a9b5000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <0x00000000d6c48368> (a 
> org.apache.flink.runtime.blob.BlobServer)
>       at java.lang.Thread.join(Thread.java:1252)
>       - locked <0x00000000d6c48368> (a 
> org.apache.flink.runtime.blob.BlobServer)
>       at java.lang.Thread.join(Thread.java:1326)
>       at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319)
>       at 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406)
>       - locked <0x00000000d5d27350> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505
> {code}
> because the BlobServer.run() method ignores interrupts:
> {code}
> "BLOB Server listener at 6124" #30 daemon prio=5 os_prio=0 
> tid=0x000000401c929800 nid=0x2b4 runnable [0x00000040263f9000]
>    java.lang.Thread.State: RUNNABLE
>       at java.net.PlainSocketImpl.socketAccept(Native Method)
>       at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
>       at java.net.ServerSocket.implAccept(ServerSocket.java:560)
>       at java.net.ServerSocket.accept(ServerSocket.java:528)
>       at 
> org.apache.flink.util.NetUtils.acceptWithoutTimeout(NetUtils.java:143)
>       at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:268)
> {code}
> This issue was introduced in FLINK-24156 and first mentioned in 
> https://issues.apache.org/jira/browse/FLINK-24113?focusedCommentId=17459414&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17459414



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to