[ https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731822#comment-17731822 ]
Wencong Liu commented on FLINK-32319: ------------------------------------- Hi [~1026688210] , maybe you could try this config "taskmanager.network.request-backoff.max: 20000" because its default value is 10000. > flink can't the partition of network after restart > -------------------------------------------------- > > Key: FLINK-32319 > URL: https://issues.apache.org/jira/browse/FLINK-32319 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.17.1 > Environment: centos 7. > jdk 8. > flink1.17.1 application mode on yarn > flink configuration : > ``` > $internal.application.program-args sql2 > $internal.deployment.config-dir /data/home/flink/wgcn/flink-1.17.1/conf > $internal.yarn.log-config-file > /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties > akka.ask.timeout 100s > blob.server.port 15402 > classloader.check-leaked-classloader false > classloader.resolve-order parent-first > env.java.opts.taskmanager -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 > execution.attached true > execution.checkpointing.aligned-checkpoint-timeout 10 min > execution.checkpointing.externalized-checkpoint-retention > RETAIN_ON_CANCELLATION > execution.checkpointing.interval 10 min > execution.checkpointing.min-pause 10 min > execution.savepoint-restore-mode NO_CLAIM > execution.savepoint.ignore-unclaimed-state false > execution.shutdown-on-attached-exit false > execution.target embedded > high-availability zookeeper > high-availability.cluster-id application_1684133071014_7202676 > high-availability.storageDir hdfs://xxxx/user/flink/recovery > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > internal.cluster.execution-mode NORMAL > internal.io.tmpdirs.use-local-default true > io.tmp.dirs > /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676 > jobmanager.execution.failover-strategy region > jobmanager.memory.heap.size 9261023232b > jobmanager.memory.jvm-metaspace.size 268435456b > jobmanager.memory.jvm-overhead.max 1073741824b > jobmanager.memory.jvm-overhead.min 1073741824b > jobmanager.memory.off-heap.size 134217728b > jobmanager.memory.process.size 10240m > jobmanager.rpc.address xxxx > jobmanager.rpc.port 31332 > metrics.reporter.promgateway.deleteOnShutdown true > metrics.reporter.promgateway.factory.class > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory > metrics.reporter.promgateway.hostUrl xxxx:9091 > metrics.reporter.promgateway.interval 60 SECONDS > metrics.reporter.promgateway.jobName join_phase3_v7 > metrics.reporter.promgateway.randomJobNameSuffix false > parallelism.default 128 > pipeline.classpaths > pipeline.jars > file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar > rest.address xxxx > rest.bind-address xxxxx > rest.bind-port 50000-50500 > rest.flamegraph.enabled true > restart-strategy.failure-rate.delay 10 s > restart-strategy.failure-rate.failure-rate-interval 1 min > restart-strategy.failure-rate.max-failures-per-interval 6 > restart-strategy.type exponential-delay > state.backend.type filesystem > state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn > state.checkpoints.num-retained 3 > taskmanager.memory.managed.fraction 0 > taskmanager.memory.network.max 600mb > taskmanager.memory.process.size 10240m > taskmanager.memory.segment-size 128kb > taskmanager.network.memory.buffers-per-channel 8 > taskmanager.network.memory.floating-buffers-per-gate 800 > taskmanager.numberOfTaskSlots 2 > web.port 0 > web.tmpdir /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534 > yarn.application-attempt-failures-validity-interval 60000 > yarn.application-attempts 3 > yarn.application.name join_phase3_v7 > yarn.heartbeat.container-request-interval 700 > ``` > Reporter: wgcn > Priority: Major > Attachments: image-2023-06-13-07-14-48-958.png > > > flink can't the partition of network after restart, lead that job can not > restoring > !image-2023-06-13-07-14-48-958.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)