Dandandan commented on PR #20820: URL: https://github.com/apache/datafusion/pull/20820#issuecomment-4120554070
> I have implemented the work stealing scheduler idea and while it seems to show promise it still clearly is not ready (given the results above) > > <img alt="Screenshot 2026-03-24 at 7 24 53 AM" width="1994" height="890" src="https://private-user-images.githubusercontent.com/490673/568323287-12e32d37-ba59-457c-b6de-bde63237a64e.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzQzNzU4NTgsIm5iZiI6MTc3NDM3NTU1OCwicGF0aCI6Ii80OTA2NzMvNTY4MzIzMjg3LTEyZTMyZDM3LWJhNTktNDU3Yy1iNmRlLWJkZTYzMjM3YTY0ZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMyNFQxODA1NThaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1kNjkzNDRhMDkzZTYyNzUwYWVjODZhMGMwOGRhNjdkNTIwMjYwMjE2MjcyOGI2ZmVlOWFmMjBhYzQwZDBjOWY0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.V5JAN8Rux0-K0nN8Owg3LB3-B5EgKyTibsaSIwhXYf4"> <img alt="Screenshot 2026-03-24 at 7 26 13 AM" width="2000" height="876" src="https://private-user-images.githubusercontent.com/490673/ 568323289-058c18f1-a470-46af-bd59-b7ed5819e94d.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzQzNzU4NTgsIm5iZiI6MTc3NDM3NTU1OCwicGF0aCI6Ii80OTA2NzMvNTY4MzIzMjg5LTA1OGMxOGYxLWE0NzAtNDZhZi1iZDU5LWI3ZWQ1ODE5ZTk0ZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMyNFQxODA1NThaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mN2MxYjBhZGU1YTRhOWRlMGQ0Mzc0MTI5OGEzNTQ0ZTY0YTM3OWEwY2I0MTRmOTBmMDcwNzlhNDg4NGI3NGUzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.3TmFKs4SmHZdQspd7VvpUhG0CmypsdxYL8o7-AHQk4w"> <img alt="Screenshot 2026-03-24 at 7 30 16 AM" width="2000" height="855" src="https://private-user-images.githubusercontent.com/490673/568323295-f798182c-a9a7-4671-8dd6-29877646a604.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1Y nVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzQzNzU4NTgsIm5iZiI6MTc3NDM3NTU1OCwicGF0aCI6Ii80OTA2NzMvNTY4MzIzMjk1LWY3OTgxODJjLWE5YTctNDY3MS04ZGQ2LTI5ODc3NjQ2YTYwNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMyNFQxODA1NThaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02OWY2ZmFiYTUxZjk2ZmM1N2NmMmExM2QyNWQxZjAxYzA5N2VhZjZhYmEyODg5MTNkZWEzNDFlYjk2NjQ0Nzk2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.Mqh4NZN0in9a29RF0Z5Eg3cJ_97EA2NcxEVGHEpOgv4"> > I spent quite a long time messing with Q23 and saw widely varying results. I think this is due to the fact that Q23 is very sensitive to the order in which the files are processed (e.g. top-k / dynamic filtering) > > However, I have also observed some flakiness in running tests and I think that is because some plans require a certain partitioning (e.g. to ensure data is passed across streams) and so having the FileStream process data across multiple partitions in this case causes incorrectness errors. > > My plan is to ensure we don't enable work stealing for plans that require data not to cross partitions Looking st the screenshot I can see that DuckDB seems to do the IO on the same thread as where the processing happens (if your machine has 16 cores). DataFusion starts 50+ tokio blocking threads based on the concurrent IO, as each call to `spawn_blocking` will spawn a new one if all threads are busy. In my understanding, this is is not fully ideal: * each thread will consume some memory * (more importantly) IO is probably loaded on a different cpu core than where needed. Probably it is not that big of a deal for Parquet (as decompressing / decode Parquet mighy be ~1-2GB/s and memory bandwidth is way higher. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
