zhangyue19921010 commented on PR #5416: URL: https://github.com/apache/hudi/pull/5416#issuecomment-1183759671
Hi @alexeykudinkin Thanks a lot for your attention! Also Glad to have more discussions :) 1. Disruptor not only performs well in the multi-production multi-consumption model, but also has good performance in single production and consumption scenarios due to the lock-free design. Based on https://github.com/LMAX-Exchange/disruptor/wiki/Performance-Results. And it is officially recommended to use a single-production model in the disruptor https://lmax-exchange.github.io/disruptor/user-guide/index.html#_introduction > One of the best ways to improve performance in concurrent systems is to adhere to the [Single Writer Principle](https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html), this applies to the Disruptor. If you are in the situation where there will only ever be a single thread producing events into the Disruptor, then you can take advantage of this to gain additional performance. 2. I am not sure why it may lead to a OOM `when reader is reading too fast and writing is not able to keep up`. Based on my limited knowledge, when consumers consume data quickly and producers produce data relatively slowly, consumers will continue to wait, which affects the throughput of the application. What we want is to allow the producer's data to be delivered to the consumer side as soon as possible, thereby improving CPU usage and throughput. This is why we want to use the queue of Disruptor to improve the efficiency of data flow through a lock-free design. 3. I also fully agree with your point that the disruptor cannot solve all problems, nor can it achieve satisfactory optimization results in all scenarios. This depends on where the performance bottleneck of the user's hudi ingestion is. For example, like the scenario simulated in the benchmark, if the user's schema is relatively simple, the downstream consumption is fast, and the bottleneck is production or a lot of time is spent waiting for the data to be ready, then the Disruptor may be able to play the greatest value in this scenario. As for conclusion `avoiding locks in that path will be able to reduce our compute footprint by about ~10%`, glad this will have a positive impact, at least it won't get worse :) Maybe we can do some tests on a variety of scenarios specially the bottleneck is the production data speed to see how the optimization works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
