I have been working on modularizing the C++ library by extending FileSystem
construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.

Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.

In the C++ library and its bindings, FileSystems can be constructed from a
URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK. Since
URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a URI
has access to the filesystem to which it refers, these filesystem URIs are
transitively secret.

This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name elsewhere:
"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets
may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)

Is explicit out-of-URI secret management necessary, or is it sufficient to
document that since filesystem URIs represent access to their referent they
must be guarded accordingly?

Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077

Reply via email to