I can't speak to why Hadoop or fsspec are designed that way, but the
following come to mind:
- Systems typically draw a separation between system config, such as
credentials, and the user-supplied URI, which may be provided as part of
a SQL string, for example
- It avoids needing to define a URI encoding mechanism, which may
require percent-encoding, etc...
- Related to the above, but the ergonomics of the more explicit approach
will normally be superior
- There is a good chance of URIs being incorporated into logs and so
having them contain credentials is problematic, especially if in a
non-standard encoding that likely won't be sanitised
The ObjectStore design discussion can be found here [1], and was
designed to consolidate the approaches already incorporated into
DataFusion and polars which share this separation.
[1]: https://github.com/apache/arrow-rs-object-store/issues/176
On 09/04/2025 16:06, Benjamin Kietzman wrote:
Thanks Raphael,
Do you have a reference which explains the rationale for that separation?
It's not obvious to me what the priorities are.
I can guess that a URI without secrets might be shared between multiple
users,
and their individual tokens etc inserted to grant distinct access. However
for that
case it seems to me that there wouldn't be a significant difference between
FileSystem.from_uri(uri, **extra_options_and_secrets)
FileSystem.from_uri(uri_template.format(**extra_options_and_secrets))
On Wed, Apr 9, 2025 at 9:44 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:
I'm not all that familiar with the C++ filesystem abstraction, but for
ObjectStore, the closest equivalent abstraction in the Rust ecosystem,
we follow what fsspec [1] and Hadoop [2] do and allow providing a set of
key-value string pairs along with the URI [3]. This provides a great
deal of flexibility to end-users as to where and how to source this
configuration, including potentially fetching secrets from other sources
or the environment.
[1]:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem
[2]:
https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration-
[3]:
https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html
On 09/04/2025 15:35, Benjamin Kietzman wrote:
I have been working on modularizing the C++ library by extending
FileSystem
construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.
Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.
In the C++ library and its bindings, FileSystems can be constructed from
a
URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK.
Since
URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a
URI
has access to the filesystem to which it refers, these filesystem URIs
are
transitively secret.
This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name
elsewhere:
"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New
secrets
may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)
Is explicit out-of-URI secret management necessary, or is it sufficient
to
document that since filesystem URIs represent access to their referent
they
must be guarded accordingly?
Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077