ankitsultana commented on issue #12390:
URL: https://github.com/apache/pinot/issues/12390#issuecomment-1945473584

   Took a deep-dive today and found the root-cause. Here's the sequence of 
events which triggers this:
   
   * We have a consuming segment on a server. On time of commit, it receives a 
DISCARD from the controller.
   * There's a Huge GC
   * Instead of receiving a CONSUMING to ONLINE transition, the segment 
receives a OFFLINE to ONLINE transition.
   * Since the TDM already has a Segment Data Manager (SDM) of the consuming 
segment, we end up skipping the `addSegment` call and mark it succeeded 
anyways. 
[RealtimeSegmentDataManager.java#L387](https://github.com/apache/pinot/blob/38d86b0a6432e9a7249f1692ace36b6e34171b0a/pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeTableDataManager.java#L387)
   
   So the `RealtimeSegmentDataManager` is never destroyed and the 
`_partitionGroupConsumerSemaphore` is never released.
   
   As more and more segments for this partition are committed on other servers, 
we receive many OFFLINE to CONSUMING transitions, which pile up.
   
   When there's another big GC, and say Helix reconnects, then the 
`HelixTaskExecutor` will be shutdown, leading to interruption of all the 
semaphore acquire calls and the original error message above. (note 
`onBecomeConsumingFromOffline` in the stack trace).
   
   @Jackie-Jiang : this is an additional case where 
`RealtimeTableDataManager#addSegment` can be called. Do you have any 
recommendations on how to handle this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to