alamb opened a new issue, #16306: URL: https://github.com/apache/datafusion/issues/16306
### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/issues/13456 I would like to make it easy to use datafusion-cli to query files on S3 as possible For example, after https://github.com/apache/datafusion/issues/16299 is merged I would like to be able to read from the ClickBench example datasets: ```sql CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet'; ``` However, when I run this I get the following error: ```sql > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet'; Object Store error: Generic S3 error: Error performing HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet in 499.73175ms - Received redirect without LOCATION, this normally indicates an incorrectly configured region ``` This does give me the hint that the region is incorrectly configured which is good, however, it doesn't tell me "WHAT" region I need If I provide the correct region (`eu-central-1`) it works great: ```sql > CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' OPTIONS ('aws.region' 'eu-central-1'); 0 row(s) fetched. Elapsed 1.182 seconds. > select count(*) from hits; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row(s) fetched. Elapsed 0.780 seconds. ``` I noticed that that DuckDB and ClickHouse do not require the region to be set: ```sql v1.2.2 7c039464e4 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D select count(*) from read_parquet('s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet'); ┌────────────────┐ │ count_star() │ │ int64 │ ├────────────────┤ │ 1000000 │ │ (1.00 million) │ └────────────────┘ ``` ### Describe the solution you'd like I would like `datafusion-cli` to automatically find the region as well I did some investigation and the correct region is returned via a response header, which you can see via ```shell curl -v -X HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet ... ... > HEAD /clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet HTTP/1.1 > Host: s3.us-east-1.amazonaws.com > User-Agent: curl/8.7.1 > Accept: */* > * Request completely sent off < HTTP/1.1 301 Moved Permanently < x-amz-bucket-region: eu-central-1 < x-amz-request-id: Q44G0APVQH5JHHC4 < x-amz-id-2: cubLiiba/Q138g5SbNNlSoGtARMxobuq7GhA+3t39il+Wj50HNPBUh4bOGVS2Bwlc6k4f0lp6r0= < Content-Type: application/xml < Transfer-Encoding: chunked < Date: Fri, 06 Jun 2025 14:19:57 GMT < Server: AmazonS3 ``` Note the `x-amz-bucket-region` in the response: ``` < x-amz-bucket-region: eu-central-1 ``` I suspect this will need some change upstream in the object_store crate and I will work on filing an upstream ticket now ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org