zhangyue19921010 commented on PR #5416:
URL: https://github.com/apache/hudi/pull/5416#issuecomment-1183759671

   Hi @alexeykudinkin Thanks a lot for your attention! Also Glad to have more 
discussions :)
   
   1. Disruptor not only performs well in the multi-production 
multi-consumption model, but also has good performance in single production and 
consumption scenarios due to the lock-free design. Based on 
https://github.com/LMAX-Exchange/disruptor/wiki/Performance-Results. And it is 
officially recommended to use a single-production model in the disruptor 
https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction 
   > One of the best ways to improve performance in concurrent systems is to 
adhere to the [Single Writer 
Principle](https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html),
 this applies to the Disruptor. If you are in the situation where there will 
only ever be a single thread producing events into the Disruptor, then you can 
take advantage of this to gain additional performance.
   
   2. I am not sure why it may lead to a OOM `when reader is reading too fast 
and writing is not able to keep up`.  Based on my limited knowledge, when 
consumers consume data quickly and producers produce data relatively slowly, 
consumers will continue to wait, which affects the throughput of the 
application. What we want is to allow the producer's data to be delivered to 
the consumer side as soon as possible, thereby improving CPU usage and 
throughput. This is why we want to use the queue of Disruptor to improve the 
efficiency of data flow through a lock-free design.
   3. I also fully agree with your point that the disruptor cannot solve all 
problems, nor can it achieve satisfactory optimization results in all 
scenarios. This depends on where the performance bottleneck of the user's hudi 
ingestion is. For example, like the scenario simulated in the benchmark, if the 
user's schema is relatively simple, the downstream consumption is fast, and the 
bottleneck is production or a lot of time is spent waiting for the data to be 
ready, then the Disruptor may be able to play the greatest value in this 
scenario.
   
   
   As for conclusion `avoiding locks in that path will be able to reduce our 
compute footprint by about ~10%`, glad this will have a positive impact, at 
least it won't get worse :) Maybe we can do some tests on a variety of 
scenarios specially the bottleneck is the production data speed to see how the 
optimization works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to