Hi Pavel, Sorry for bringing this thread up so late. I was digging into the usage of the Flink history server and I found one situation where there would be no logs and no failure/success message from the cluster: In very rare case in our Flink-YARN session cluster: if an application master (JobManager running container) fails and being restarted as a YARN 2nd attempt (we haven't enable HA) - then there will be no logs of archiving being logged whatsoever. However in this case the there would be a completely new AM container brought up running the JM again (e.g. new log files)
I am not exactly sure whether this suites your scenarios. Could you describe a bit more on how your cluster was configured? Thanks, Rong On Mon, Nov 25, 2019 at 10:49 AM Chesnay Schepler <ches...@apache.org> wrote: > I'm afraid I can't think of a solution. I don't see a way how this > operation can succeed or fail without anything being logged. > > Is the cluster behaving normally afterwards? Could you check whether the > numRunningJobs > ticks down properly after the job was canceled? > > > On 22/11/2019 13:27, Pavel Potseluev wrote: > > Hi Chesnay, > > We archive jobs on s3 file system. We don't configure a throttling for > write operations and afaik it isn't possible now and will be implemented in > FLINK-13251 <https://issues.apache.org/jira/browse/FLINK-13251>. And > other write operations (like checkpoints saving) work fine. But I don't see > archived job or message about archiving failure at all. It looks like Flink > just didn't try to save job to archive. > > 21.11.2019, 17:17, "Chesnay Schepler" <ches...@apache.org> > <ches...@apache.org>: > > If the archiving fails there should be some log message, like "Failed to > archive job" or "Could not archive completed job..." . > If nothing of the sort is logged my first instinct would be that the > operation is being slowed down, _a lot_. > > Where are you archiving them to? Could it be the write operation is being > throttled heavily? > > On 21/11/2019 13:48, Pavel Potseluev wrote: > > Hi Vino, > > Usually Flink archives jobs correctly and the problem is rarely > reproduced. So I think it isn't a problem with configuration. > > Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled: > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint > Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Triggering checkpoint 1872 @ 1574092333218 for job > 5ec264a20bb5005cdbd8e23a5e59f136. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260 > [flink-akka.actor.default-dispatcher-30] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed > checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes > in 23541 ms). > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint > Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - > Triggering checkpoint 1873 @ 1574092393218 for job > 5ec264a20bb5005cdbd8e23a5e59f136. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > bureau-user-offers-statistics-AUTORU-USERS_AUTORU > (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to > CANCELLING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom > File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING > to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) > (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) > (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > user-offers-statistics-init-from-file -> Map (1/2) > (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > user-offers-statistics-init-from-file -> Map (2/2) > (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Keyed Reduce -> > Map -> Sink: user-offers-statistics-autoru (1/2) > (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280 > [flink-akka.actor.default-dispatcher-40] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Keyed Reduce -> > Map -> Sink: user-offers-statistics-autoru (2/2) > (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom > File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from > CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2) > (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2) > (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > user-offers-statistics-init-from-file -> Map (2/2) > (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - > user-offers-statistics-init-from-file -> Map (1/2) > (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Keyed Reduce -> > Map -> Sink: user-offers-statistics-autoru (1/2) > (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Keyed Reduce -> > Map -> Sink: user-offers-statistics-autoru (2/2) > (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job > bureau-user-offers-statistics-AUTORU-USERS_AUTORU > (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to > CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping > checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714 > [flink-akka.actor.default-dispatcher-2] INFO > o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Shutting down > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore - Removing > /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from > ZooKeeper > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966 > [cluster-io-thread-6] INFO > org.apache.flink.runtime.checkpoint.CompletedCheckpoint - Checkpoint with > ID 1872 at > 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872' > not discarded. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044 > [flink-akka.actor.default-dispatcher-2] INFO > o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045 > [flink-akka.actor.default-dispatcher-2] INFO > o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Removing > /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259 > [flink-akka.actor.default-dispatcher-2] INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Job > 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 > [flink-akka.actor.default-dispatcher-31] INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Releasing idle > slot [0553df66161f5d78f4b41d8c8c32c21f]. > > 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085 > [flink-akka.actor.default-dispatcher-31] INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Releasing idle > slot [498b9bf0c0f2188ff739d72e6df288dc]. > > > 21.11.2019, 06:07, "vino yang" <yanghua1...@gmail.com> > <yanghua1...@gmail.com>: > > If everything is OK(your config options about archive dir and history > server is correct), Flink should archive the completed job. > > You said you did not find any exceptions in the log about failing to > archive. But any other exceptions? Can you share the logs about your scene? > > Best, > Vino > > Pavel Potseluev <potsel...@yandex-team.ru> 于2019年11月21日周四 上午2:25写道: > > Hi all, > > We see occasionally that flink doesn't save information about canceled job > to archive directory (configured by jobmanager.archive.fs.dir property). And > there are no exceptions in the log about failing archiving. It's a problem > in our use case because our script for deploying jobs relies on flink > history server to find latest checkpoint for some job. Does flink guarantee > saving data to archive? If so, any ideas why it doesn't work > sometimes? Flink version is 1.8.0. > > -- > Best regards, > Pavel Potseluev > Software developer, Yandex.Classifieds LLC > > > > > -- > Best regards, > Pavel Potseluev > Software developer, Yandex.Classifieds LLC > > > > > > > -- > Best regards, > Pavel Potseluev > Software developer, Yandex.Classifieds LLC > > > >