Nice to see this feature brought up. I tried to implement this feature in our internal clusters, and know that it's a very complicated feature, CC hdfs-dev to bring more discussion. By the way, I'm not sure whether virtual thread of higher jdk will help in this case.
On Mon, May 20, 2024 at 10:10 AM zhangjian <1361320...@qq.com.invalid> wrote: > Hello everyone, currently there are some shortcomings in the RPC of HDFS > router: > > Currently the router's handler thread is synchronized, when the *handler* > thread > adds the call to connection.calls, it needs to wait until the *connection* > notifies > the call to complete, and then Only after the response is put into the > response queue can a new call be obtained from the call queue and > processed. Therefore, the concurrency performance of the router is limited > by the number of handlers; a simple example is as follows: If the number of > handlers is 1 and the maximum number of calls in the connection thread is > 10, then even if the connection thread can send 10 requests to the > downstream ns, since the number of handlers is 1, the router can only > process one request after another. > > Since the performance of router rpc is mainly limited by the number of > handlers, the most effective way to improve rpc performance currently is to > increase the number of handlers. Letting the router create a large number > of handler threads will also increase the number of thread switches and > cannot maximize the use of machine performance. > > There are usually multiple ns downstream of the router. If the handler > forwards the request to an ns with poor performance, it will cause the > handler to wait for a long time. Due to the reduction of available > handlers, the router's ability to handle ns requests with normal > performance will be reduced. From the perspective of the client, the > performance of the downstream ns of the router has deteriorated at this > time. We often find that the call queue of the downstream ns is not high, > but the call queue of the router is very high. > > Therefore, although the main function of the router is to federate and > handle requests from multiple NSs, the current synchronous RPC performance > cannot satisfy the scenario where there are many NSs downstream of the > router. Even if the concurrent performance of the router can be improved by > increasing the number of handlers, it is still relatively slow. More > threads will increase the CPU context switching time, and in fact many of > the handler threads are in a blocked state, which is undoubtedly a waste of > thread resources. When a request enters the router, there is no guarantee > that there will be a running handler at this time. > > > Therefore, I consider asynchronous router rpc. Please view the issues: > https://issues.apache.org/jira/browse/HDFS-17531 for the complete > solution. > > And you can also view this PR: https://github.com/apache/hadoop/pull/6838, > which is just a demo, but it completes the core asynchronous RPC function. > If you think asynchronous routing is feasible, we can consider splitting > this PR for easy review in the future. > > The PDF is attached and can also be viewed through issues. > > Welcome everyone to exchange and discuss! >