davidhewitt opened a new issue, #10709:
URL: https://github.com/apache/datafusion/issues/10709

   ### Describe the bug
   
   I'm experimenting with datafusion 38 (using as a library, not the cli) to 
query against parquet files in Google Cloud Storage. For now I'm doing a basic 
test, just `select distinct(some_column) from my_table`. 
   
   
   
   I wrapped the `ObjectStore` and `TableProvider` traits to add tracing to 
their execution, and this is what I see:
   
   
![image](https://github.com/apache/datafusion/assets/1939362/4e2e061d-cf09-46c9-842a-58a4ad5bb14e)
   
   I'd like to improve the query performance to be much faster than my current 
30s, and I think one dial in my control is to resize the parquet files so that 
there are fewer requests.
   
   Another observation is that the requests to Google Cloud Storage during 
query execution appear to be done *in series*. The table scan, on the other 
hand, is being successfully done in parallel so takes a lot less time even 
though the total requests are a similar scale and duration per request.
   
   This serial pattern on the requests seems like a performance bug to me.
   
   ### To Reproduce
   
   `select distinct(some_column) from my_table`, where the table provider is 
parquet in Google Cloud Storage.
   
   ### Expected behavior
   
   I would hope we can parallelise the requests to Google Cloud Storage! Best 
case is I'm told it's user error and there's a simple fix, otherwise if it's 
work needed in datafusion perhaps I can contribute a fix.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to