Re: [Discuss] RBF: Aynchronous router RPC.

zhangjian Tue, 21 May 2024 19:03:35 -0700

Hi, Sangjin Lee, thank you for your attention. I will use my free time to do a 
performance comparison recently.


> 2024年5月22日 03:42，Sangjin Lee <sj...@apache.org> 写道：
> 
> Thanks for the great proposal, Zhangjian. On point #3, I suspect it should
> be fairly straightforward to create a small isolated synthetic test to
> prove (or disprove) the benefits of this approach. By driving a controlled
> amount of requests per second, you could see latency, memory, CPU, etc.
> Ideally, it should show meaningful improvements without much degradation in
> other metrics. Would you be able to spend some time doing that?
> 
> Thanks,
> Sangjin
> 
> On Tue, May 21, 2024 at 5:13 AM zhangjian <1361320...@qq.com.invalid> wrote:
> 
>> Hi, xiaoqiao he, thank you for your reply.
>> 
>> 1.Currently, the server and client protocols within router can be
>> implemented by extends existing protocols and adding asynchronous
>> functionality, so it will not affect existing synchronization protocols.
>> RouterClientNamenodeProtocolServerSideTranslatorPB
>> RouterClientProtocolTranslatorPB
>> RouterGetUserMappingsProtocolServerSideTranslatorPB
>> RouterGetUserMappingsProtocolTranslatorPB
>> RouterNamenodeProtocolServerSideTranslatorPB
>> RouterNamenodeProtocolTranslatorPB
>> RouterRefreshUserMappingsProtocolServerSideTranslatorPB
>> RouterRefreshUserMappingsProtocolTranslatorPB
>> 
>> The following issues have implemented asynchronous callbacks for
>> Rpc.server, but I have not found any other modules to use related functions
>> Server HADOOP-11552 HADOOP-17046
>> In the implementation of asynchronous Rpc.client, this issue is directly
>> used
>> Client HADOOP-13226
>> Therefore, I believe that asynchronous routers are safe for modifying the
>> RPC protocol, RPC server, and client
>> 
>> 2. Forwarding requests to multiple downstream ns, the synchronous router
>> handler adds requests from multiple downstream ns to the thread pool
>> (RouterRpcClient.executorService), and then waits for responses from all
>> downstream ns before returning. Since threads in the thread pool also
>> process rpc requests synchronously, similar to a handler, the number of
>> threads in the thread pool directly affects the performance of
>> invoiceConcurrent, which in turn affects the performance of the handler.
>> In asynchronous router implementation, the handler calls invoiceConcurrent
>> to simply convert a request into multiple requests and add them to the asyn
>> handler thread pool, which can then process the next request in the call
>> queue; When a connection thread of a downstream ns receives a response, it
>> will hand it over to the async response for processing. The async response
>> thread will determine whether it has received all responses from the
>> downstream ns. If it does, it will continue to process the response.
>> Otherwise, the async response thread will process the next response. The
>> asynchronous router uses CompletableFuture.allOf() to implement
>> asynchronous invoiceConcurrent, and the handler, async handler, async
>> response, and connection thread still does not need to wait synchronously.
>> In addition, synchronous routers not only have drawbacks in multi ns
>> environments, but also in single downstream ns situations, it is often
>> difficult to decide how many handlers to set for the router, setting it too
>> much will waste thread resources, and setting it too small will not be able
>> to give pressure to downstream ns; Asynchronous routers can push requests
>> to downstream ns without considering how to set handlers. Asynchronous
>> routers can also better connect to more downstream storage services that
>> support the HDFS protocol, with better scalability.
>> 
>> 3.Since I have not yet deployed asynchronous routers to our own cluster,
>> there is no performance comparison. However, theoretically, I believe that
>> asynchronous routers will occupy more memory than synchronous routers.
>> However, I do not believe that it will occupy a lot, especially since we
>> can control the maximum number of requests entering the router, as
>> CompletableFuture is stable and widely used; In other aspects, it should be
>> far superior to synchronous routers, especially in downstream scenarios
>> with more ns.If anyone is interested, you can also help to make a
>> performance comparison
>> 
>>> 2024年5月21日 11:39，Xiaoqiao He <hexiaoq...@apache.org> 写道：
>>> 
>>> Thanks for this great proposal!
>>> 
>>> Some questions after reviewing the design doc (sorry didn't review PR
>>> carefully which is too large.)
>>> 1. This solution will involve RPC framework update, will it affect other
>>> modules and how to
>>> keep other modules off these changes.
>>> 2. Some RPC requests should be forward concurrently to all downstream NS,
>>> will it cover
>>> this case in this solution.
>>> 3. Considering there is one init-version implementation, did you collect
>>> some benchmark vs
>>> the current synchronous model of DFSRouter?
>>> Thanks again.
>>> 
>>> Best Regards,
>>> - He Xiaoqiao
>>> 
>>> On Tue, May 21, 2024 at 11:21 AM zhangjian <1361320...@qq.com.invalid>
>>> wrote:
>>> 
>>>> Thank you for your positive attitude towards this feature. You can debug
>>>> the UTs provided in PR to better understand the current asynchronous
>>>> calling function.
>>>> 
>>>>> 2024年5月21日 02:04，Simbarashe Dzinamarira <simbadz...@apache.org> 写道：
>>>>> 
>>>>> Excited to see this feature as well. I'll spend more time understanding
>>>> the
>>>>> proposal and implementation.
>>>>> 
>>>>> On Mon, May 20, 2024 at 7:55 AM zhangjian <1361320...@qq.com.invalid>
>>>> wrote:
>>>>> 
>>>>>> Hi, Yuanbo liu,  thank you for your interest in this feature, I think
>>>> the
>>>>>> difficulty of an asynchronous router is not only to implement
>>>> asynchronous
>>>>>> functions, but also to consider the readability and reusability of the
>>>>>> code, so as to facilitate the development of the community. I also
>>>> planned
>>>>>> to do the virtual thread you mentioned at the beginning, virtual
>> Threads
>>>>>> can achieve asynchronousization elegantly at the code level, but the
>>>>>> biggest problem is that it is not easy to upgrade the jdk version, no
>>>>>> matter in the community or in the actual production environment.
>>>> Therefore,
>>>>>> I later used CompletableFuture, which is currently supported by jdk 8,
>>>> to
>>>>>> achieve asynchronousization. The router is stateless, and the router
>> rpc
>>>>>> process is very clear. Therefore, even if CompletableFuture itself is
>>>> not
>>>>>> as readable as the virtual thread, if we design it well, we can make
>> the
>>>>>> asynchronous process look very clear.
>>>>>> 
>>>>>> 
>>>>>>> 2024年5月20日 10:56，Yuanbo Liu <liuyuanb...@gmail.com> 写道：
>>>>>>> 
>>>>>>> Nice to see this feature brought up. I tried to implement this
>> feature
>>>> in
>>>>>>> our internal clusters, and know that it's a very complicated feature,
>>>> CC
>>>>>>> hdfs-dev to bring more discussion.
>>>>>>> By the way, I'm not sure whether virtual thread of higher jdk will
>> help
>>>>>> in
>>>>>>> this case.
>>>>>>> 
>>>>>>> On Mon, May 20, 2024 at 10:10 AM zhangjian <1361320...@qq.com.invalid
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hello everyone, currently there are some shortcomings in the RPC of
>>>> HDFS
>>>>>>>> router：
>>>>>>>> 
>>>>>>>> Currently the router's handler thread is synchronized, when the
>>>>>> *handler* thread
>>>>>>>> adds the call to connection.calls, it needs to wait until the
>>>>>> *connection* notifies
>>>>>>>> the call to complete, and then Only after the response is put into
>> the
>>>>>>>> response queue can a new call be obtained from the call queue and
>>>>>>>> processed. Therefore, the concurrency performance of the router is
>>>>>> limited
>>>>>>>> by the number of handlers; a simple example is as follows: If the
>>>>>> number of
>>>>>>>> handlers is 1 and the maximum number of calls in the connection
>> thread
>>>>>> is
>>>>>>>> 10, then even if the connection thread can send 10 requests to the
>>>>>>>> downstream ns, since the number of handlers is 1, the router can
>> only
>>>>>>>> process one request after another.
>>>>>>>> 
>>>>>>>> Since the performance of router rpc is mainly limited by the number
>> of
>>>>>>>> handlers, the most effective way to improve rpc performance
>> currently
>>>>>> is to
>>>>>>>> increase the number of handlers. Letting the router create a large
>>>>>> number
>>>>>>>> of handler threads will also increase the number of thread switches
>>>> and
>>>>>>>> cannot maximize the use of machine performance.
>>>>>>>> 
>>>>>>>> There are usually multiple ns downstream of the router. If the
>> handler
>>>>>>>> forwards the request to an ns with poor performance, it will cause
>> the
>>>>>>>> handler to wait for a long time. Due to the reduction of available
>>>>>>>> handlers, the router's ability to handle ns requests with normal
>>>>>>>> performance will be reduced. From the perspective of the client, the
>>>>>>>> performance of the downstream ns of the router has deteriorated at
>>>> this
>>>>>>>> time. We often find that the call queue of the downstream ns is not
>>>>>> high,
>>>>>>>> but the call queue of the router is very high.
>>>>>>>> 
>>>>>>>> Therefore, although the main function of the router is to federate
>> and
>>>>>>>> handle requests from multiple NSs, the current synchronous RPC
>>>>>> performance
>>>>>>>> cannot satisfy the scenario where there are many NSs downstream of
>> the
>>>>>>>> router. Even if the concurrent performance of the router can be
>>>>>> improved by
>>>>>>>> increasing the number of handlers, it is still relatively slow. More
>>>>>>>> threads will increase the CPU context switching time, and in fact
>> many
>>>>>> of
>>>>>>>> the handler threads are in a blocked state, which is undoubtedly a
>>>>>> waste of
>>>>>>>> thread resources. When a request enters the router, there is no
>>>>>> guarantee
>>>>>>>> that there will be a running handler at this time.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Therefore, I consider asynchronous router rpc. Please view the
>> issues:
>>>>>>>> https://issues.apache.org/jira/browse/HDFS-17531  for the complete
>>>>>>>> solution.
>>>>>>>> 
>>>>>>>> And you can also view this PR:
>>>>>> https://github.com/apache/hadoop/pull/6838,
>>>>>>>> which is just a demo, but it completes the core asynchronous RPC
>>>>>> function.
>>>>>>>> If you think asynchronous routing is feasible, we can consider
>>>> splitting
>>>>>>>> this PR for easy review in the future.
>>>>>>>> 
>>>>>>>> The PDF is attached and can also be viewed through issues.
>>>>>>>> 
>>>>>>>> Welcome everyone to exchange and discuss!
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
>>>>>> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>>>> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
>> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
>> 
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [Discuss] RBF: Aynchronous router RPC.

Reply via email to