Hi Ben and all,

Sorry for chiming in lately. I do find the URI-and-kv-pairs interface attractive.

That said, some filesystem options can't reasonably be expressed as strings. For example, `S3Options` has a `std::shared_ptr<const KeyValueMetadata> default_metadata` and a `std::shared_ptr<S3RetryStrategy>`.

So, perhaps we want to allow for generic option values, and therefore have an interface looking like:
```
/// \param[in] uri the URI to give access to
/// \param[in] options a list of backend-specific filesystem options
///            Each option is a (name, value) pair.
///            The expected type is specific to the backend and
///            option name.
Result<std::shared_ptr<FileSystem>> FileSystemFromUri(
    std::string_view uri,
    const std::vector<std::pair<std::string_view, std::any>>& options);
```


Le 10/04/2025 à 19:38, Benjamin Kietzman a écrit :
Hi Bryce,

I meant to say that since the C++ interface for constructing filesystems is
[1]

     Result<std::shared_ptr<FileSystem>> FileSystemFromUri(const std::string
&uri);

it follows that the only argument (the uri) must contain all the
information required.

Certainly alternate interfaces such as the uri-and-kv-pairs which Raphael
described
would be possible, sorry for the confusion; I definitely did not mean MUST
in an rfc2119
sense.

[1]
https://arrow.apache.org/docs/cpp/api/filesystem.html#high-level-factory-functions

On Wed, Apr 9, 2025 at 1:09 PM Bryce Mecum <bryceme...@gmail.com> wrote:

Hi Ben, would you be able to elaborate on this part:

Since URIs must be complete specifications of a filesystem, this
necessitates inclusion of the secrets required by S3 in the URI. Since
anyone with a URI has access to the filesystem to which it refers, these
filesystem URIs are transitively secret.

Why must URIs be the complete specification of a filesystem? Why does
having a URI confer access to the resource, or does that point just
follow from your previous?

On Wed, Apr 9, 2025 at 7:37 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

I have been working on modularizing the C++ library by extending
FileSystem
construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.

Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.

In the C++ library and its bindings, FileSystems can be constructed from
a
URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK.
Since
URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a
URI
has access to the filesystem to which it refers, these filesystem URIs
are
transitively secret.

This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name
elsewhere:
"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New
secrets
may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)

Is explicit out-of-URI secret management necessary, or is it sufficient
to
document that since filesystem URIs represent access to their referent
they
must be guarded accordingly?

Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077



Reply via email to