[I] Google Cloud Storage requests during query execution being performed in series [datafusion]

via GitHub Wed, 29 May 2024 03:07:53 -0700


davidhewitt opened a new issue, #10709:
URL: https://github.com/apache/datafusion/issues/10709

### Describe the bug

I'm experimenting with datafusion 38 (using as a library, not the cli) to
query against parquet files in Google Cloud Storage. For now I'm doing a basic
test, just `select distinct(some_column) from my_table`.

I wrapped the `ObjectStore` and `TableProvider` traits to add tracing to
their execution, and this is what I see:

![image](https://github.com/apache/datafusion/assets/1939362/4e2e061d-cf09-46c9-842a-58a4ad5bb14e)

I'd like to improve the query performance to be much faster than my current
30s, and I think one dial in my control is to resize the parquet files so that
there are fewer requests.

Another observation is that the requests to Google Cloud Storage during
query execution appear to be done *in series*. The table scan, on the other
hand, is being successfully done in parallel so takes a lot less time even
though the total requests are a similar scale and duration per request.

This serial pattern on the requests seems like a performance bug to me.

### To Reproduce

`select distinct(some_column) from my_table`, where the table provider is
parquet in Google Cloud Storage.

### Expected behavior

I would hope we can parallelise the requests to Google Cloud Storage! Best
case is I'm told it's user error and there's a simple fix, otherwise if it's
work needed in datafusion perhaps I can contribute a fix.

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Google Cloud Storage requests during query execution being performed in series [datafusion]

Reply via email to