zhuzhurk opened a new pull request #10867: [FLINK-14701][runtime] Fix 
MultiTaskSlot to not remove slots which are not its children
URL: https://github.com/apache/flink/pull/10867
 
 
   ## What is the purpose of the change
   
   If a SharedSlotOversubscribedException happens, the MultiTaskSlot will 
release some of its child SingleTaskSlot. The triggered releasing will trigger 
a re-allocation of the task slot right inside SingleTaskSlot#release(...). So 
that a previous allocation in SloSharingManager#allTaskSlots will be replaced 
by the new allocation because they share the same slotRequestId.
   However, the SingleTaskSlot#release(...) will then invoke 
MultiTaskSlot#releaseChild to release the previous allocation with the 
slotRequestId, which will unexpectedly remove the new allocation from the 
SloSharingManager.
   In this way, slot leak happens because the pending slot request is not 
tracked by the SloSharingManager and cannot be released when its payload 
terminates.
   
   Note that the it's not a problem in 1.10/master now since 
SharedSlotOversubscribedException is removed in FLINK-14314. However, it's 
still an issue in 1.9.
   However, the fix would still help in master to avoid similar issue to happen 
in the future.
   
   A test case testNoSlotLeakOnSharedSlotOversubscribedException which exhibits 
this issue can be found at 
https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14.
   
   ## Brief change log
   
     - *See the code change.*
   
   
   ## Verifying this change
   
   The change is verified on 1.9 with the test in 
https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to