mbutrovich commented on PR #4335:
URL:
https://github.com/apache/datafusion-comet/pull/4335#issuecomment-4478448502
Thanks for the careful citations @karuppayya, those helped a lot.
I'm trying to figure out whether the broadcast and refresh machinery in this
PR is something Comet should own, or whether a vendor implementation under
#4309's SPI could give you the same end result with less Comet-side surface
area. I'd love your read on that.
One scope thing I want to flag up front: we also need this for raw Parquet
on S3, and any other native S3 reader Comet grows. Iceberg-vended creds are one
important case, but vendors with custom signers or per-path STS for plain
Parquet hit the same wall (Hadoop S3A signers don't surface the secret,
`AWSCredentialsProvider.getCredentials()` is parameterless). The SPI shape in
this PR is fairly Iceberg-shaped (named `CometCredentialProvider` under
`org.apache.comet.iceberg`, `initialize(catalogProperties)` keyed off the
catalog property bag, `ResolveContext(tableLocation)`), and it's not obvious
how the Parquet path fits since there's no catalog or table location. #4309
covers both paths through one SPI under `org.apache.comet.cloud.s3` with a
per-call `(bucket, path, mode)` context that works for either, and I'd want
whatever we land to keep that breadth.
On the REST-vended credentials point, the reason #4309 today can't reach
`credentials.uri` and friends is that `CometScanRule.filterStorageProperties`
strips them down to `s3.`/`gcs.`/`adls.`/`client.` prefixes before they cross
JNI. If we relax that filter so the full FileIO property map crosses as
`catalog_properties`, and we only narrow to storage-prefix keys at the
iceberg-rust `FileIOBuilder.with_prop` call site, then a vendor under #4309
sees everything `LoadTableResponse` returned. That seems to obsolete the
parallel `allFileIOProperties` field this PR adds, with a much smaller change.
On multi-catalog with shared FQCN, you're right that #4309 has the gap
today. I think it's fixable inside the existing SPI shape by extending
`CometS3CredentialDispatcher` to cache by `(FQCN, catalog id)` and adding
`initialize(Map<String,String> catalogProperties)` to
`CometS3CredentialProvider`, called once per catalog on first instantiation.
That mirrors `S3FileIO.initialize`, which is the shape vendors already write
against for Iceberg.
What I'm less sure should live in Comet is refresh and distribution. The
Rust-side TTL cache, the per-plan single-catalog invariant that throws on
multi-catalog plans, and the driver-side broadcast cache feel like things
Iceberg's own `CachedSupplier` / `VendedCredentialsProvider` and Spark's
broadcast already cover, just inside the vendor's implementation rather than in
Comet. Concretely, an Iceberg REST vended-credential
`CometS3CredentialProvider` would look roughly like this:
```java
public class IcebergRESTVendedS3Provider implements
CometS3CredentialProvider {
private volatile VendedCredentialsProvider provider;
@Override
public void initialize(Map<String, String> catalogProperties) {
this.provider = VendedCredentialsProvider.create(catalogProperties);
}
@Override
public CometS3Credentials getCredentialsForPath(
String bucket, String path, CometS3AccessMode mode) {
AwsSessionCredentials c = (AwsSessionCredentials)
provider.resolveCredentials();
long expiresAt =
c.expirationTime().map(Instant::toEpochMilli).orElse(-1L);
return new CometS3Credentials(
c.accessKeyId(), c.secretAccessKey(), c.sessionToken(), expiresAt);
}
}
```
Iceberg's `CachedSupplier` inside `VendedCredentialsProvider` handles
refresh and prefetch, so neither the vendor nor Comet has to reinvent it. We
could ship something like this as a reference implementation in Comet alongside
the SPI so REST users aren't writing it from scratch, while keeping
caching/refresh/distribution policy out of Comet itself.
Does this shape cover what you need, or is there a scenario I'm missing
where the vendor model in #4309 (extended with the unfiltered properties +
`initialize(Map)`) can't reach? Want to make sure I'm not arguing against a
real requirement.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]