Thanks Piotr, I've created a JIRA issue to track it: https://issues.apache.org/jira/browse/FLINK-9143
Alex On Thu, Apr 5, 2018 at 11:28 PM Piotr Nowojski <pi...@data-artisans.com> wrote: > Hi, > > Thanks for the details! I can confirm this behaviour. flink-conf.yaml > restart-strategy value is being completely ignored (regardless of it’s > value) when user enables checkpointing: > > env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE); > > I suspect this is a bug, but I have to confirm it. > > Thanks, Piotrek > > On 5 Apr 2018, at 12:40, Alexander Smirnov <alexander.smirn...@gmail.com> > wrote: > > jobmanager.log: > > *2018-04-05 22:37:28,348 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: restart-strategy, none* > 2018-04-05 22:37:28,353 INFO org.apache.flink.core.fs.FileSystem > - Hadoop is not in the classpath/dependencies. The > extended set of supported File Systems via Hadoop is not available. > 2018-04-05 22:37:28,506 INFO > org.apache.flink.runtime.jobmanager.JobManager - Starting > JobManager without high-availability > 2018-04-05 22:37:28,510 INFO > org.apache.flink.runtime.jobmanager.JobManager - Starting > JobManager on localhost:6123 with execution mode CLUSTER > 2018-04-05 22:37:28,517 INFO > org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot > create Hadoop Security Module because Hadoop cannot be found in the > Classpath. > 2018-04-05 22:37:28,546 INFO > org.apache.flink.runtime.security.SecurityUtils - Cannot > install HadoopSecurityContext because Hadoop cannot be found in the > Classpath. > 2018-04-05 22:37:28,591 INFO > org.apache.flink.runtime.jobmanager.JobManager - Trying to > start actor system at localhost:6123 > 2018-04-05 22:37:28,981 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2018-04-05 22:37:29,027 INFO akka.remote.Remoting > - Starting remoting > 2018-04-05 22:37:29,129 INFO akka.remote.Remoting > - Remoting started; listening on addresses :[ > akka.tcp://flink@localhost:6123] > 2018-04-05 22:37:29,135 INFO > org.apache.flink.runtime.jobmanager.JobManager - Actor > system started at akka.tcp://flink@localhost:6123 > 2018-04-05 22:37:29,148 INFO > org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics > reporter configured, no metrics will be exposed/reported. > 2018-04-05 22:37:29,152 INFO > org.apache.flink.runtime.jobmanager.JobManager - Starting > JobManager web frontend > 2018-04-05 22:37:29,161 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager log file: > /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log > 2018-04-05 22:37:29,161 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager stdout file: > /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out > 2018-04-05 22:37:29,162 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory > /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37 > for the web interface files > 2018-04-05 22:37:29,162 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Created > directory > /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913 > for web frontend JAR file uploads. > 2018-04-05 22:37:29,447 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web > frontend listening at localhost:8081 > 2018-04-05 22:37:29,447 INFO > org.apache.flink.runtime.jobmanager.JobManager - Starting > JobManager actor > 2018-04-05 22:37:29,452 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236 > 2018-04-05 22:37:29,453 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:60697 - max concurrent > requests: 50 - max backlog: 1000 > 2018-04-05 22:37:29,533 INFO > org.apache.flink.runtime.jobmanager.MemoryArchivist - Started > memory archivist akka://flink/user/archive > 2018-04-05 22:37:29,533 INFO > org.apache.flink.runtime.jobmanager.JobManager - Starting > JobManager at akka.tcp://flink@localhost:6123/user/jobmanager. > 2018-04-05 22:37:29,544 INFO > org.apache.flink.runtime.jobmanager.JobManager - JobManager > akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership > with leader session ID Some(00000000-0000-0000-0000-000000000000). > 2018-04-05 22:37:29,545 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Trying to associate with JobManager leader > akka.tcp://flink@localhost:6123/user/jobmanager > 2018-04-05 22:37:29,552 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - Resource Manager associating with leading JobManager Actor[ > akka://flink/user/jobmanager#-853250886] - leader session > 00000000-0000-0000-0000-000000000000 > 2018-04-05 22:37:30,495 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > - TaskManager f0b0370186ab3c865db63fe60ca68e08 has started. > 2018-04-05 22:37:30,497 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at 192.168.0.26 ( > akka.tcp://flink@mb-sr-asmirnov.local:60696/user/taskmanager) as > 2972a72a7223e63bb5a4fedd159c0b78. Current number of registered hosts is 1. > Current number of alive task slots is 1. > 2018-04-05 22:38:29,355 INFO org.apache.flink.runtime.client.JobClient > - Checking and uploading JAR files > 2018-04-05 22:38:29,639 INFO > org.apache.flink.runtime.jobmanager.JobManager - Submitting > job 43ecfe9cb258b7f624aad9868d306edb (Failed job). > *2018-04-05 22:38:29,643 INFO > org.apache.flink.runtime.jobmanager.JobManager - Using > restart strategy > FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647 > <(214)%20748-3647>, delayBetweenRestartAttempts=10000) for > 43ecfe9cb258b7f624aad9868d306edb.* > 2018-04-05 22:38:29,656 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > recovers via failover strategy: full graph restart > > > > On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov < > alexander.smirn...@gmail.com> wrote: > >> Hi Piotr, >> >> I'm using Flink 1.4.2 >> >> it's a standard flink distribution downloaded and unpacked. >> >> added the following lines to conf/flink-conf.yaml: >> restart-strategy: none >> state.backend: rocksdb >> state.backend.fs.checkpointdir: >> file:///tmp/nfsrecovery/flink-checkpoints-metadata >> state.backend.rocksdb.checkpointdir: >> file:///tmp/nfsrecovery/flink-checkpoints-rocksdb >> >> created new java project as described at >> https://ci.apache.org/projects/flink/flink-docs-release-1.4/quickstart/java_api_quickstart.html >> >> here's the code: >> >> public class FailedJob >> { >> static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class); >> >> public static void main( String[] args ) throws Exception >> { >> final StreamExecutionEnvironment env = >> StreamExecutionEnvironment.getExecutionEnvironment(); >> >> >> env.enableCheckpointing(5000, >> CheckpointingMode.EXACTLY_ONCE); >> >> DataStream<String> stream = >> env.fromCollection(Arrays.asList("test")); >> >> stream.map(new MapFunction<String, String>(){ >> @Override >> public String map(String obj) { >> throw new NullPointerException("NPE"); >> } >> }); >> >> env.execute("Failed job"); >> } >> } >> >> attaching screenshots, please let me know if more info is needed >> >> Alex >> >> >> >> >> On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <pi...@data-artisans.com> >> wrote: >> >>> Hi, >>> >>> Can you provide more details, like post your configuration/log >>> files/screen shots from web UI and Flink version being used? >>> >>> Piotrek >>> >>> > On 5 Apr 2018, at 06:07, Alexander Smirnov < >>> alexander.smirn...@gmail.com> wrote: >>> > >>> > Hello, >>> > >>> > I've defined restart strategy in flink-conf.yaml as none. WebUI / Job >>> Manager section confirms that. >>> > But looks like this setting is disregarded. >>> > >>> > When I go into job's configuration in the WebUI, in the Execution >>> Configuration section I can see: >>> > Max. number of execution retries Restart with fixed delay >>> (10000 ms). #2147483647 <(214)%20748-3647> restart attempts. >>> > >>> > Do you think it is a bug? >>> > >>> > Alex >>> >>> >