davidhewitt opened a new issue, #10709: URL: https://github.com/apache/datafusion/issues/10709
### Describe the bug I'm experimenting with datafusion 38 (using as a library, not the cli) to query against parquet files in Google Cloud Storage. For now I'm doing a basic test, just `select distinct(some_column) from my_table`. I wrapped the `ObjectStore` and `TableProvider` traits to add tracing to their execution, and this is what I see:  I'd like to improve the query performance to be much faster than my current 30s, and I think one dial in my control is to resize the parquet files so that there are fewer requests. Another observation is that the requests to Google Cloud Storage during query execution appear to be done *in series*. The table scan, on the other hand, is being successfully done in parallel so takes a lot less time even though the total requests are a similar scale and duration per request. This serial pattern on the requests seems like a performance bug to me. ### To Reproduce `select distinct(some_column) from my_table`, where the table provider is parquet in Google Cloud Storage. ### Expected behavior I would hope we can parallelise the requests to Google Cloud Storage! Best case is I'm told it's user error and there's a simple fix, otherwise if it's work needed in datafusion perhaps I can contribute a fix. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
