I'm afraid I can't think of a solution. I don't see a way how this operation can succeed or fail without anything being logged.

Is the cluster behaving normally afterwards? Could you check whether the numRunningJobs ticks down properly after the job was canceled?


On 22/11/2019 13:27, Pavel Potseluev wrote:
Hi Chesnay,
We archive jobs on s3 file system. We don't configure a throttling for write operations and afaik it isn't possible now and will be implemented in FLINK-13251 <https://issues.apache.org/jira/browse/FLINK-13251>. And other write operations (like checkpoints saving) work fine. But I don't see archived job or message about archiving failure at all. It looks like Flink just didn't try to save job to archive.
21.11.2019, 17:17, "Chesnay Schepler" <ches...@apache.org>:

    If the archiving fails there should be some log message, like
    "Failed to archive job" or "Could not archive completed job..." .
    If nothing of the sort is logged my first instinct would be that
    the operation is being slowed down, _a lot_.
    Where are you archiving them to? Could it be the write operation
    is being throttled heavily?
    On 21/11/2019 13:48, Pavel Potseluev wrote:

        Hi Vino,
        Usually Flink archives jobs correctly and the problem is
        rarely reproduced. So I think it isn't a problem with
        configuration.
        Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was
        canceled:

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:52:13.294 [Checkpoint Timer] INFO
            org.apache.flink.runtime.checkpoint.CheckpointCoordinator
            - Triggering checkpoint 1872 @ 1574092333218 for job
            5ec264a20bb5005cdbd8e23a5e59f136.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:52:37.260 [flink-akka.actor.default-dispatcher-30] INFO
            org.apache.flink.runtime.checkpoint.CheckpointCoordinator
            - Completed checkpoint 1872 for job
            5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes in 23541
            ms).

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:13.314 [Checkpoint Timer] INFO
            org.apache.flink.runtime.checkpoint.CheckpointCoordinator
            - Triggering checkpoint 1873 @ 1574092393218 for job
            5ec264a20bb5005cdbd8e23a5e59f136.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU
            (5ec264a20bb5005cdbd8e23a5e59f136) switched from state
            RUNNING to CANCELLING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.279 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: Custom File Source (1/1)
            (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING
            to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: kafka-source-moderation-update-journal-autoru ->
            Filter -> Flat Map (1/2)
            (47656a3c4fc70e19622acca31267e41f) switched from RUNNING
            to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: kafka-source-moderation-update-journal-autoru ->
            Filter -> Flat Map (2/2)
            (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING
            to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            user-offers-statistics-init-from-file -> Map (1/2)
            (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING
            to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            user-offers-statistics-init-from-file -> Map (2/2)
            (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING
            to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru
            (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from
            RUNNING to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.280 [flink-akka.actor.default-dispatcher-40] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru
            (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from
            RUNNING to CANCELING.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.299 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: Custom File Source (1/1)
            (934d89cf3d7999b40225dd8009b5493c) switched from CANCELING
            to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: kafka-source-moderation-update-journal-autoru ->
            Filter -> Flat Map (1/2)
            (47656a3c4fc70e19622acca31267e41f) switched from CANCELING
            to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.300 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Source: kafka-source-moderation-update-journal-autoru ->
            Filter -> Flat Map (2/2)
            (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING
            to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.344 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            user-offers-statistics-init-from-file -> Map (2/2)
            (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING
            to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.345 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            user-offers-statistics-init-from-file -> Map (1/2)
            (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING
            to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.706 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru
            (1/2) (cfb291033df3f19c9745a6f2fd25e037) switched from
            CANCELING to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Keyed Reduce -> Map -> Sink: user-offers-statistics-autoru
            (2/2) (9ce7cd66199513fa97ac44d7617f0c83) switched from
            CANCELING to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.executiongraph.ExecutionGraph -
            Job bureau-user-offers-statistics-AUTORU-USERS_AUTORU
            (5ec264a20bb5005cdbd8e23a5e59f136) switched from state
            CANCELLING to CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.checkpoint.CheckpointCoordinator
            - Stopping checkpoint coordinator for job
            5ec264a20bb5005cdbd8e23a5e59f136.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.714 [flink-akka.actor.default-dispatcher-2] INFO
            o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
            - Shutting down

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.966 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore
            - Removing
            
/moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136
            from ZooKeeper

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:19.966 [cluster-io-thread-6] INFO
            org.apache.flink.runtime.checkpoint.CompletedCheckpoint -
            Checkpoint with ID 1872 at
            
's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872'
            not discarded.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:20.044 [flink-akka.actor.default-dispatcher-2] INFO
            o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
            - Shutting down.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:20.045 [flink-akka.actor.default-dispatcher-2] INFO
            o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
            - Removing
            /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from
            ZooKeeper

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:53:20.259 [flink-akka.actor.default-dispatcher-2] INFO
            org.apache.flink.runtime.dispatcher.StandaloneDispatcher -
            Job 5ec264a20bb5005cdbd8e23a5e59f136 reached globally
            terminal state CANCELED.

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO
            org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl -
            Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

            771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18
            18:54:24.085 [flink-akka.actor.default-dispatcher-31] INFO
            org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl -
            Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

        21.11.2019, 06:07, "vino yang" <yanghua1...@gmail.com>
        <mailto:yanghua1...@gmail.com>:

            If everything is OK(your config options about archive dir
            and history server is correct), Flink should archive the
            completed job.
            You said you did not find any exceptions in the log about
            failing to archive. But any other exceptions? Can you
            share the logs about your scene?
            Best,
            Vino
            Pavel Potseluev <potsel...@yandex-team.ru
            <mailto:potsel...@yandex-team.ru>> 于2019年11月21日周四
            上午2:25写道:

                Hi all,
                We see occasionally that flink doesn't save
                information about canceled job to archive directory
                (configured by jobmanager.archive.fs.dir
                property). And there are no exceptions in the log
                about failing archiving. It's a problem in our use
                case because our script for deploying jobs relies on
                flink history server to find latest checkpoint for
                some job. Does flink guarantee saving data to archive?
                If so, any ideas why it doesn't work sometimes? Flink
                version is 1.8.0.
-- Best regards,
                Pavel Potseluev
                Software developer, Yandex.Classifieds LLC

-- Best regards,
        Pavel Potseluev
        Software developer, Yandex.Classifieds LLC

--
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC


Reply via email to