Re: [Discuss] RBF: Aynchronous router RPC.

zhangjian Tue, 21 May 2024 05:15:19 -0700

Hi, xiaoqiao he, thank you for your reply.

1.Currently, the server and client protocols within router can be implemented 
by extends existing protocols and adding asynchronous functionality, so it will 
not affect existing synchronization protocols. 
RouterClientNamenodeProtocolServerSideTranslatorPB
RouterClientProtocolTranslatorPB
RouterGetUserMappingsProtocolServerSideTranslatorPB
RouterGetUserMappingsProtocolTranslatorPB
RouterNamenodeProtocolServerSideTranslatorPB
RouterNamenodeProtocolTranslatorPB
RouterRefreshUserMappingsProtocolServerSideTranslatorPB
RouterRefreshUserMappingsProtocolTranslatorPB


The following issues have implemented asynchronous callbacks for Rpc.server, 
but I have not found any other modules to use related functions
Server HADOOP-11552 HADOOP-17046
In the implementation of asynchronous Rpc.client, this issue is directly used
Client HADOOP-13226
Therefore, I believe that asynchronous routers are safe for modifying the RPC 
protocol, RPC server, and client

2. Forwarding requests to multiple downstream ns, the synchronous router 
handler adds requests from multiple downstream ns to the thread pool 
(RouterRpcClient.executorService), and then waits for responses from all 
downstream ns before returning. Since threads in the thread pool also process 
rpc requests synchronously, similar to a handler, the number of threads in the 
thread pool directly affects the performance of invoiceConcurrent, which in 
turn affects the performance of the handler.
In asynchronous router implementation, the handler calls invoiceConcurrent to 
simply convert a request into multiple requests and add them to the asyn 
handler thread pool, which can then process the next request in the call queue; 
When a connection thread of a downstream ns receives a response, it will hand 
it over to the async response for processing. The async response thread will 
determine whether it has received all responses from the downstream ns. If it 
does, it will continue to process the response. Otherwise, the async response 
thread will process the next response. The asynchronous router uses 
CompletableFuture.allOf() to implement asynchronous invoiceConcurrent, and the 
handler, async handler, async response, and connection thread still does not 
need to wait synchronously.
In addition, synchronous routers not only have drawbacks in multi ns 
environments, but also in single downstream ns situations, it is often 
difficult to decide how many handlers to set for the router, setting it too 
much will waste thread resources, and setting it too small will not be able to 
give pressure to downstream ns; Asynchronous routers can push requests to 
downstream ns without considering how to set handlers. Asynchronous routers can 
also better connect to more downstream storage services that support the HDFS 
protocol, with better scalability.

3.Since I have not yet deployed asynchronous routers to our own cluster, there 
is no performance comparison. However, theoretically, I believe that 
asynchronous routers will occupy more memory than synchronous routers. However, 
I do not believe that it will occupy a lot, especially since we can control the 
maximum number of requests entering the router, as CompletableFuture is stable 
and widely used; In other aspects, it should be far superior to synchronous 
routers, especially in downstream scenarios with more ns.If anyone is 
interested, you can also help to make a performance comparison

> 2024年5月21日 11:39，Xiaoqiao He <[email protected]> 写道：
> 
> Thanks for this great proposal!
> 
> Some questions after reviewing the design doc (sorry didn't review PR
> carefully which is too large.)
> 1. This solution will involve RPC framework update, will it affect other
> modules and how to
> keep other modules off these changes.
> 2. Some RPC requests should be forward concurrently to all downstream NS,
> will it cover
> this case in this solution.
> 3. Considering there is one init-version implementation, did you collect
> some benchmark vs
> the current synchronous model of DFSRouter?
> Thanks again.
> 
> Best Regards,
> - He Xiaoqiao
> 
> On Tue, May 21, 2024 at 11:21 AM zhangjian <[email protected]>
> wrote:
> 
>> Thank you for your positive attitude towards this feature. You can debug
>> the UTs provided in PR to better understand the current asynchronous
>> calling function.
>> 
>>> 2024年5月21日 02:04，Simbarashe Dzinamarira <[email protected]> 写道：
>>> 
>>> Excited to see this feature as well. I'll spend more time understanding
>> the
>>> proposal and implementation.
>>> 
>>> On Mon, May 20, 2024 at 7:55 AM zhangjian <[email protected]>
>> wrote:
>>> 
>>>> Hi, Yuanbo liu,  thank you for your interest in this feature, I think
>> the
>>>> difficulty of an asynchronous router is not only to implement
>> asynchronous
>>>> functions, but also to consider the readability and reusability of the
>>>> code, so as to facilitate the development of the community. I also
>> planned
>>>> to do the virtual thread you mentioned at the beginning, virtual Threads
>>>> can achieve asynchronousization elegantly at the code level, but the
>>>> biggest problem is that it is not easy to upgrade the jdk version, no
>>>> matter in the community or in the actual production environment.
>> Therefore,
>>>> I later used CompletableFuture, which is currently supported by jdk 8,
>> to
>>>> achieve asynchronousization. The router is stateless, and the router rpc
>>>> process is very clear. Therefore, even if CompletableFuture itself is
>> not
>>>> as readable as the virtual thread, if we design it well, we can make the
>>>> asynchronous process look very clear.
>>>> 
>>>> 
>>>>> 2024年5月20日 10:56，Yuanbo Liu <[email protected]> 写道：
>>>>> 
>>>>> Nice to see this feature brought up. I tried to implement this feature
>> in
>>>>> our internal clusters, and know that it's a very complicated feature,
>> CC
>>>>> hdfs-dev to bring more discussion.
>>>>> By the way, I'm not sure whether virtual thread of higher jdk will help
>>>> in
>>>>> this case.
>>>>> 
>>>>> On Mon, May 20, 2024 at 10:10 AM zhangjian <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hello everyone, currently there are some shortcomings in the RPC of
>> HDFS
>>>>>> router：
>>>>>> 
>>>>>> Currently the router's handler thread is synchronized, when the
>>>> *handler* thread
>>>>>> adds the call to connection.calls, it needs to wait until the
>>>> *connection* notifies
>>>>>> the call to complete, and then Only after the response is put into the
>>>>>> response queue can a new call be obtained from the call queue and
>>>>>> processed. Therefore, the concurrency performance of the router is
>>>> limited
>>>>>> by the number of handlers; a simple example is as follows: If the
>>>> number of
>>>>>> handlers is 1 and the maximum number of calls in the connection thread
>>>> is
>>>>>> 10, then even if the connection thread can send 10 requests to the
>>>>>> downstream ns, since the number of handlers is 1, the router can only
>>>>>> process one request after another.
>>>>>> 
>>>>>> Since the performance of router rpc is mainly limited by the number of
>>>>>> handlers, the most effective way to improve rpc performance currently
>>>> is to
>>>>>> increase the number of handlers. Letting the router create a large
>>>> number
>>>>>> of handler threads will also increase the number of thread switches
>> and
>>>>>> cannot maximize the use of machine performance.
>>>>>> 
>>>>>> There are usually multiple ns downstream of the router. If the handler
>>>>>> forwards the request to an ns with poor performance, it will cause the
>>>>>> handler to wait for a long time. Due to the reduction of available
>>>>>> handlers, the router's ability to handle ns requests with normal
>>>>>> performance will be reduced. From the perspective of the client, the
>>>>>> performance of the downstream ns of the router has deteriorated at
>> this
>>>>>> time. We often find that the call queue of the downstream ns is not
>>>> high,
>>>>>> but the call queue of the router is very high.
>>>>>> 
>>>>>> Therefore, although the main function of the router is to federate and
>>>>>> handle requests from multiple NSs, the current synchronous RPC
>>>> performance
>>>>>> cannot satisfy the scenario where there are many NSs downstream of the
>>>>>> router. Even if the concurrent performance of the router can be
>>>> improved by
>>>>>> increasing the number of handlers, it is still relatively slow. More
>>>>>> threads will increase the CPU context switching time, and in fact many
>>>> of
>>>>>> the handler threads are in a blocked state, which is undoubtedly a
>>>> waste of
>>>>>> thread resources. When a request enters the router, there is no
>>>> guarantee
>>>>>> that there will be a running handler at this time.
>>>>>> 
>>>>>> 
>>>>>> Therefore, I consider asynchronous router rpc. Please view the issues:
>>>>>> https://issues.apache.org/jira/browse/HDFS-17531  for the complete
>>>>>> solution.
>>>>>> 
>>>>>> And you can also view this PR:
>>>> https://github.com/apache/hadoop/pull/6838,
>>>>>> which is just a demo, but it completes the core asynchronous RPC
>>>> function.
>>>>>> If you think asynchronous routing is feasible, we can consider
>> splitting
>>>>>> this PR for easy review in the future.
>>>>>> 
>>>>>> The PDF is attached and can also be viewed through issues.
>>>>>> 
>>>>>> Welcome everyone to exchange and discuss!
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Discuss] RBF: Aynchronous router RPC.

Reply via email to