alamb opened a new issue, #16306:
URL: https://github.com/apache/datafusion/issues/16306

   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/issues/13456
   
   I would like to make it easy to use datafusion-cli to query files on S3 as 
possible
   
   For example, after https://github.com/apache/datafusion/issues/16299 is 
merged I would like to be able to read from the ClickBench example datasets:
   
   ```sql
   CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
   ```
   
   However, when I run this I get the following error:
   
   ```sql
   > CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
   Object Store error: Generic S3 error: Error performing HEAD 
https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
 in 499.73175ms - Received redirect without LOCATION, this normally indicates 
an incorrectly configured region
   ```
   
   
   This does give me the hint that the region is incorrectly configured which 
is good, however, it doesn't tell me "WHAT" region I need
   
   
   If I provide the correct region (`eu-central-1`) it works great:
   ```sql
   > CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet'
 OPTIONS ('aws.region' 'eu-central-1');
   0 row(s) fetched.
   Elapsed 1.182 seconds.
   
   > select count(*) from hits;
   +----------+
   | count(*) |
   +----------+
   | 1000000  |
   +----------+
   1 row(s) fetched.
   Elapsed 0.780 seconds.
   ```
   
   I noticed that that DuckDB and ClickHouse do not require the region to be 
set:
   
   ```sql
   v1.2.2 7c039464e4
   Enter ".help" for usage hints.
   Connected to a transient in-memory database.
   Use ".open FILENAME" to reopen on a persistent database.
   D select count(*) from 
read_parquet('s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet');
   ┌────────────────┐
   │  count_star()  │
   │     int64      │
   ├────────────────┤
   │    1000000     │
   │ (1.00 million) │
   └────────────────┘
   ```
   
   ### Describe the solution you'd like
   
   I would like `datafusion-cli` to automatically find the region as well
   
   I did some investigation and the correct region is returned via a response 
header, which you can see via
   
   ```shell
   curl -v -X HEAD 
https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
   ...
   ...
   > HEAD 
/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet 
HTTP/1.1
   > Host: s3.us-east-1.amazonaws.com
   > User-Agent: curl/8.7.1
   > Accept: */*
   >
   * Request completely sent off
   < HTTP/1.1 301 Moved Permanently
   < x-amz-bucket-region: eu-central-1
   < x-amz-request-id: Q44G0APVQH5JHHC4
   < x-amz-id-2: 
cubLiiba/Q138g5SbNNlSoGtARMxobuq7GhA+3t39il+Wj50HNPBUh4bOGVS2Bwlc6k4f0lp6r0=
   < Content-Type: application/xml
   < Transfer-Encoding: chunked
   < Date: Fri, 06 Jun 2025 14:19:57 GMT
   < Server: AmazonS3
   ```
   
   Note the `x-amz-bucket-region` in the response:
   ```
   < x-amz-bucket-region: eu-central-1
   ```
   
   I suspect this will need some change upstream in the object_store crate and 
I will work on filing an upstream ticket now
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to