Hey Andras, I had a quick look at HADOOP-18143, the methods in question in ProtobufRpcEngine2 are identical to the ones in ProtobufRpcEngine. So, I am not very sure how the race condition doesn't happen in ProtobufRpcEngine. I have to debug and spend some more time, considering that I have reverted HADOOP-18082 for now to unblock YARN. Though the issue would still be there as you said, but will give us some time to analyse.
Thanks -Ayush On Mon, 28 Feb 2022 at 21:26, Gyori Andras <gand...@cloudera.com.invalid> wrote: > Hey everyone! > > We have started seeing test failures in YARN PRs for a while. We have > identified the problematic commit, which is HADOOP-18082 > <https://issues.apache.org/jira/browse/HADOOP-18082>, however, this change > just revealed the race condition lying in ProtobufRpcEngine2 introduced in > HADOOP-17046 <https://issues.apache.org/jira/browse/HADOOP-17046>. We have > also fixed the underlying issue via a locking mechanism, presented in > HADOOP-18143 <https://issues.apache.org/jira/browse/HADOOP-18143>, but > since it is out of our area of expertise, we can neither verify nor > guarantee that it will not cause some subtle issues in the RPC system. > As we think it is a core part of Hadoop, we would use feedback from someone > who is proficient in this part. > > Regards: > Andras >