geserdugarov commented on PR #18276: URL: https://github.com/apache/hudi/pull/18276#issuecomment-4093989774
@vinothchandar , @yihua , first benchmarks for COW table with DSv2 read is ready https://github.com/apache/hudi/pull/18351. ```text Data: 800 parquet files with 30 mln rows, 300 column, 100 GB in total. ============================================================ DSv2 vs DSv1 PERFORMANCE COMPARISON ============================================================ Full scan (COW) : DSv1 avg 273.3s, DSv2 avg 278.0s, speedup 0.98x (DSv1 FASTER) Projected (COW) : DSv1 avg 7.3s, DSv2 avg 5.9s, speedup 1.24x (DSv2 FASTER) Filter (COW) : DSv1 avg 7.2s, DSv2 avg 6.0s, speedup 1.20x (DSv2 FASTER) Limit (COW) : DSv1 avg 56.6s, DSv2 avg 59.5s, speedup 0.95x (DSv1 FASTER) Aggregate COUNT(*) : DSv1 avg 3.6s, DSv2 avg 0.2s, speedup 18.43x (DSv2 FASTER) Aggregate MIN/MAX : DSv1 avg 3.8s, DSv2 avg 0.2s, speedup 20.95x (DSv2 FASTER) ``` Implementation of DSv2 read for COW is ready for review https://github.com/apache/hudi/pull/18277. I will search for reasons of 2% performance drop in full scan, and 5% drop for limit queries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
