[GitHub] [flink] edu05 commented on pull request #16487: [FLINK-22483][runtime][coordination] Recover checkpoints when JobMaster gains leadership

GitBox Sun, 25 Jul 2021 09:38:24 -0700


edu05 commented on pull request #16487:
URL: https://github.com/apache/flink/pull/16487#issuecomment-886227234

Hi @dmvk

> This task is really complex and requires lot of context, so I hope you
won't have any hard feelings if we do this.

For the more complex tasks is there a way the open source community can
collaborate more closely with you? I feel like some diagrams would have been
useful. Building in house/private software is hard enough to require closely
knit collaboration between team members, could we replicate that approach in
open source?

> Since the 1.14 release is getting closer, and there is still lot of tasks
to be done, I'd timebox this effort until Monday 26th (inclusive). If we're not
able to make this work until then, I'd take this over, so we can move on to the
next task.

There doesn't seem to be much choice, no hard feelings.

I've pushed the latest progress, mainly going back to your suggestion of
putting both test cases into the same file where the RPC flow is executed and
changing `started` from a `boolean` to an `int`. This is only because the
`recover()` method is now called twice before the `FailingMapper` code is even
executed.

At the moment the test gets stuck in `jobClient.getJobStatus()`, the reason
for this is the thread invoking the RPC is blocked at the `recover()` stage, so
it never gets to even try to fulfill the call. I get the idea is precisely to
prove the RPC thread is free to make calls but that isn't possible if it's the
one that is stuck from the action of invoking `finishedRecovering.await()`.
Images attached.

![bug2](https://user-images.githubusercontent.com/1392421/126906296-0ab39e4a-8d20-4371-ae4c-080e69da6359.png)
The thread getting blocked is `flink-akka.actor.default-dispatcher-8` and it
is that same thread that is expected to be invoking the RPC.

![debug3](https://user-images.githubusercontent.com/1392421/126906357-e2dea016-cb04-43cf-801b-8f765984ee42.png)

I tried switching a few things to no avail. If the idea is for the new call
to `recover()` to solve said problem then I'd expect
`finishedRecovering.await()` to happen on the thread that's invoking the new
call to `recover()`. But that only happens once at the beginning of the test
and can not get stuck otherwise the test just loops indefinitely at that point.
Should it get called a second time during recovery?

> Good job so far! ;)

Thanks, let's see if we can get this one in the bag on Monday, or today if
you're available :)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] edu05 commented on pull request #16487: [FLINK-22483][runtime][coordination] Recover checkpoints when JobMaster gains leadership

Reply via email to