edu05 commented on pull request #16487: URL: https://github.com/apache/flink/pull/16487#issuecomment-886227234
Hi @dmvk > This task is really complex and requires lot of context, so I hope you won't have any hard feelings if we do this. For the more complex tasks is there a way the open source community can collaborate more closely with you? I feel like some diagrams would have been useful. Building in house/private software is hard enough to require closely knit collaboration between team members, could we replicate that approach in open source? > Since the 1.14 release is getting closer, and there is still lot of tasks to be done, I'd timebox this effort until Monday 26th (inclusive). If we're not able to make this work until then, I'd take this over, so we can move on to the next task. There doesn't seem to be much choice, no hard feelings. I've pushed the latest progress, mainly going back to your suggestion of putting both test cases into the same file where the RPC flow is executed and changing `started` from a `boolean` to an `int`. This is only because the `recover()` method is now called twice before the `FailingMapper` code is even executed. At the moment the test gets stuck in `jobClient.getJobStatus()`, the reason for this is the thread invoking the RPC is blocked at the `recover()` stage, so it never gets to even try to fulfill the call. I get the idea is precisely to prove the RPC thread is free to make calls but that isn't possible if it's the one that is stuck from the action of invoking `finishedRecovering.await()`. Images attached. ![bug2](https://user-images.githubusercontent.com/1392421/126906296-0ab39e4a-8d20-4371-ae4c-080e69da6359.png) The thread getting blocked is `flink-akka.actor.default-dispatcher-8` and it is that same thread that is expected to be invoking the RPC. ![debug3](https://user-images.githubusercontent.com/1392421/126906357-e2dea016-cb04-43cf-801b-8f765984ee42.png) I tried switching a few things to no avail. If the idea is for the new call to `recover()` to solve said problem then I'd expect `finishedRecovering.await()` to happen on the thread that's invoking the new call to `recover()`. But that only happens once at the beginning of the test and can not get stuck otherwise the test just loops indefinitely at that point. Should it get called a second time during recovery? > Good job so far! ;) Thanks, let's see if we can get this one in the bag on Monday, or today if you're available :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org