alamb opened a new pull request, #16300: URL: https://github.com/apache/datafusion/pull/16300
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes https://github.com/apache/datafusion/issues/16299 - Related to https://github.com/apache/datafusion/issues/13456 ## Rationale for this change I want to be able to access public s3 buckets without providing (valid) s3 credentials ## What changes are included in this PR? 1. Add `skip_signature` option to `datafusion-cli` `CREATE EXTERNAL TABLE` 2. Default to `skip_signature` when other credentials are not provided 3. Update documentation Before this PR: ```sql DataFusion CLI v47.0.0 > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'; Object Store error: Generic S3 error: the credential provider was not enabled ``` After this PR: ```sql DataFusion CLI v48.0.0 > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'; selec0 row(s) fetched. Elapsed 1.575 seconds. > select count(*) from nyc_taxi_rides; +------------+ | count(*) | +------------+ | 1310903963 | +------------+ 1 row(s) fetched. Elapsed 3.011 seconds. ``` ## Are these changes tested? Yes, new unit tests are added and I tested it manually For example, if you provide credentials, they take precidence over the signature: ```shell AWS_ACCESS_KEY_ID=foo AWS_SECRET_ACCESS_KEY=bar cargo run -p datafusion-cli ``` ```sql > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'; Object Store error: Generic S3 error: Error performing list request: Error performing GET https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data?list-type=2&prefix=nyc_taxi_rides%2Fdata%2Ftripdata_parquet%2F in 134.200375ms - Server returned non-2xx status code: 403 Forbidden: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>foo</AWSAccessKeyId><RequestId>ZAEM63Q02FQXYMTA</RequestId><HostId>mYh2PUtKzDxjrPA4vQm4d+Qae9TiNpCUDDTS5BP4jTayKVE4BQbSpT/+HSIAdzt3lne6G0sxNmE=</HostId></Error> ``` But you can override this with `SKIP_SIGNATURE` ```sql > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/' OPTIONS(AWS.SKIP_SIGNATURE 'true'); 0 row(s) fetched. Elapsed 1.455 seconds. ``` ## Are there any user-facing changes? Easier to use `datafusion-cli` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org