[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

Wencong Liu (Jira) Mon, 12 Jun 2023 19:09:58 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731822#comment-17731822
 ]


Wencong Liu commented on FLINK-32319:
-------------------------------------

Hi [~1026688210] , maybe you could try this config 
"taskmanager.network.request-backoff.max: 20000" because its default value is 
10000.

> flink can't the partition of network after restart
> --------------------------------------------------
>
>                 Key: FLINK-32319
>                 URL: https://issues.apache.org/jira/browse/FLINK-32319
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.17.1
>         Environment: centos 7.
> jdk 8.
> flink1.17.1 application mode on yarn 
> flink configuration :
> ```
> $internal.application.program-args    sql2
> $internal.deployment.config-dir       /data/home/flink/wgcn/flink-1.17.1/conf
> $internal.yarn.log-config-file        
> /data/home/flink/wgcn/flink-1.17.1/conf/log4j.properties
> akka.ask.timeout      100s
> blob.server.port      15402
> classloader.check-leaked-classloader  false
> classloader.resolve-order     parent-first
> env.java.opts.taskmanager     -XX:+UseG1GC -XX:MaxGCPauseMillis=1000 
> execution.attached    true
> execution.checkpointing.aligned-checkpoint-timeout    10 min
> execution.checkpointing.externalized-checkpoint-retention     
> RETAIN_ON_CANCELLATION
> execution.checkpointing.interval      10 min
> execution.checkpointing.min-pause     10 min
> execution.savepoint-restore-mode      NO_CLAIM
> execution.savepoint.ignore-unclaimed-state    false
> execution.shutdown-on-attached-exit   false
> execution.target      embedded
> high-availability     zookeeper
> high-availability.cluster-id  application_1684133071014_7202676
> high-availability.storageDir  hdfs://xxxx/user/flink/recovery
> high-availability.zookeeper.path.root /flink
> high-availability.zookeeper.quorum    xxxxx
> internal.cluster.execution-mode       NORMAL
> internal.io.tmpdirs.use-local-default true
> io.tmp.dirs   
> /data1/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data3/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data4/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data5/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data6/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data7/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data8/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data9/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data10/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data11/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676,/data12/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676
> jobmanager.execution.failover-strategy        region
> jobmanager.memory.heap.size   9261023232b
> jobmanager.memory.jvm-metaspace.size  268435456b
> jobmanager.memory.jvm-overhead.max    1073741824b
> jobmanager.memory.jvm-overhead.min    1073741824b
> jobmanager.memory.off-heap.size       134217728b
> jobmanager.memory.process.size        10240m
> jobmanager.rpc.address        xxxx
> jobmanager.rpc.port   31332
> metrics.reporter.promgateway.deleteOnShutdown true
> metrics.reporter.promgateway.factory.class    
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory
> metrics.reporter.promgateway.hostUrl  xxxx:9091
> metrics.reporter.promgateway.interval 60 SECONDS
> metrics.reporter.promgateway.jobName  join_phase3_v7
> metrics.reporter.promgateway.randomJobNameSuffix      false
> parallelism.default   128
> pipeline.classpaths   
> pipeline.jars 
> file:/data2/nm-local-dir/usercache/flink/appcache/application_1684133071014_7202676/container_e16_1684133071014_7202676_01_000002/frauddetection-0.1.jar
> rest.address  xxxx
> rest.bind-address     xxxxx
> rest.bind-port        50000-50500
> rest.flamegraph.enabled       true
> restart-strategy.failure-rate.delay   10 s
> restart-strategy.failure-rate.failure-rate-interval   1 min
> restart-strategy.failure-rate.max-failures-per-interval       6
> restart-strategy.type exponential-delay
> state.backend.type    filesystem
> state.checkpoints.dir hdfs://xxxxxx/user/flink/checkpoints-data/wgcn
> state.checkpoints.num-retained        3
> taskmanager.memory.managed.fraction   0
> taskmanager.memory.network.max        600mb
> taskmanager.memory.process.size       10240m
> taskmanager.memory.segment-size       128kb
> taskmanager.network.memory.buffers-per-channel        8
> taskmanager.network.memory.floating-buffers-per-gate  800
> taskmanager.numberOfTaskSlots 2
> web.port      0
> web.tmpdir    /tmp/flink-web-1b87445e-2761-4f16-97a1-8d4fc6fa8534
> yarn.application-attempt-failures-validity-interval   60000
> yarn.application-attempts     3
> yarn.application.name join_phase3_v7
> yarn.heartbeat.container-request-interval     700
> ```
>            Reporter: wgcn
>            Priority: Major
>         Attachments: image-2023-06-13-07-14-48-958.png
>
>
> flink can't the partition of network after restart, lead that job can not 
> restoring
>  !image-2023-06-13-07-14-48-958.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32319) flink can't the partition of network after restart

Reply via email to