edu05 commented on pull request #16487:
URL: https://github.com/apache/flink/pull/16487#issuecomment-886227234


   Hi @dmvk 
   
   > This task is really complex and requires lot of context, so I hope you 
won't have any hard feelings if we do this.
   
   For the more complex tasks is there a way the open source community can 
collaborate more closely with you? I feel like some diagrams would have been 
useful. Building in house/private software is hard enough to require closely 
knit collaboration between team members, could  we replicate that approach in 
open source?
   
   
   > Since the 1.14 release is getting closer, and there is still lot of tasks 
to be done, I'd timebox this effort until Monday 26th (inclusive). If we're not 
able to make this work until then, I'd take this over, so we can move on to the 
next task.
   
   There doesn't seem to be much choice, no hard feelings.
   
   
   I've pushed the latest progress, mainly going back to your suggestion of 
putting both test cases into the same file where the RPC flow is executed and 
changing `started` from a `boolean` to an `int`. This is only because the 
`recover()` method is now called twice before the `FailingMapper` code is even 
executed.
   
   At the moment the test gets stuck in `jobClient.getJobStatus()`, the reason 
for this is the thread invoking the RPC is blocked at the `recover()` stage, so 
it never gets to even try to fulfill the call. I get the idea is precisely to 
prove the RPC thread is free to make calls but that isn't possible if it's the 
one that is stuck from the action of invoking `finishedRecovering.await()`. 
Images attached.
   
   
![bug2](https://user-images.githubusercontent.com/1392421/126906296-0ab39e4a-8d20-4371-ae4c-080e69da6359.png)
   The thread getting blocked is `flink-akka.actor.default-dispatcher-8` and it 
is that same thread that is expected to be invoking the RPC.
   
   
![debug3](https://user-images.githubusercontent.com/1392421/126906357-e2dea016-cb04-43cf-801b-8f765984ee42.png)
   
   I tried switching a few things to no avail. If the idea is for the new call 
to `recover()` to solve said problem then I'd expect 
`finishedRecovering.await()` to happen on the thread that's invoking the new 
call to `recover()`. But that only happens once at the beginning of the test 
and can not get stuck otherwise the test just loops indefinitely at that point. 
Should it get called a second time during recovery?
   
   
   > Good job so far! ;)
   
   Thanks, let's see if we can get this one in the bag on Monday, or today if 
you're available :)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to