rmetzger edited a comment on pull request #14239: URL: https://github.com/apache/flink/pull/14239#issuecomment-734461109
Thanks a lot for addressing the issues I've reported. While testing this PR, I noticed that the job got stuck while submitting it: ``` 2020-11-26 20:46:43,453 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Received JobGraph submission 362e7f6a2a9901e5d6d2ea69d40a69d4 (State machine job). 2020-11-26 20:46:43,453 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Submitting job 362e7f6a2a9901e5d6d2ea69d40a69d4 (State machine job). 2020-11-26 20:46:43,455 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_4 . 2020-11-26 20:46:43,455 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Initializing job State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4). 2020-11-26 20:46:43,456 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4). 2020-11-26 20:46:43,457 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Running initialization on master for job State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4). 2020-11-26 20:46:43,457 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Successfully ran initialization on master in 0 ms. 2020-11-26 20:46:43,492 INFO org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 0 ms 2020-11-26 20:46:43,492 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - No state backend has been configured, using default (Memory / JobManager) MemoryStateBackend (data in heap memory / checkpoints to JobManager) (checkpoints: 'null', savepoints: 'null', asynchronous: TRUE, maxStateSize: 5242880) 2020-11-26 20:46:43,492 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No checkpoint found during restore. 2020-11-26 20:46:43,493 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@530fdf01 for State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4). 2020-11-26 20:46:43,493 INFO org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl [] - JobManager runner for job State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@localhost:6123/user/rpc/jobmanager_4. 2020-11-26 20:46:43,493 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Starting execution of job State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4) under job master id 00000000000000000000000000000000. 2020-11-26 20:46:43,493 INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Starting split enumerator for source Source: Kafka Source. 2020-11-26 20:46:43,501 INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Closing SourceCoordinator for source Source: Kafka Source. 2020-11-26 20:46:43,502 INFO org.apache.flink.runtime.source.coordinator.SourceCoordinator [] - Source coordinator for source Source: Kafka Source closed. 2020-11-26 20:46:43,502 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Connecting to ResourceManager akka.tcp://flink@localhost:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000) 2020-11-26 20:46:43,503 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Resolved ResourceManager address, beginning registration 2020-11-26 20:46:43,503 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager 00000000000000000000000000000...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_4 for job 362e7f6a2a9901e5d6d2ea69d40a69d4. 2020-11-26 20:46:43,504 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_4 for job 362e7f6a2a9901e5d6d2ea69d40a69d4. 2020-11-26 20:46:43,506 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000. ---- manual cancellation of the job ----- 2020-11-26 20:59:12,933 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job State machine job (362e7f6a2a9901e5d6d2ea69d40a69d4) switched from state CREATED to CANCELLING. 2020-11-26 20:59:12,933 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Kafka Source (1/4) (f08af83bdc18ed82549cafcc97b747a4) switched from CREATED to CANCELING. ``` I'm not sure if this problem is related to the scheduler or your changes, but it looks weird that the source coordinator got closed again right away. The problem is reproducible. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org