Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Benjamin Kietzman
Potentially useful further context:
the current C++ FileSystem documentation at
https://arrow.apache.org/docs/cpp/io.html#filesystems

On Wed, Apr 9, 2025 at 9:35 AM Benjamin Kietzman 
wrote:

> I have been working on modularizing the C++ library by extending
> FileSystem construction from URIs. I recently merged a PR which prompted
> some discussion [1] of how the library should handle secrets.
>
> Some FileSystems cannot be constructed without one or more secrets. For
> example, an S3FileSystem might require a proxy's username and password in
> order to configure the client which the S3FileSystem wraps. Since the
> usefulness of S3 and other filesystems which may only use default
> credentials is very limited, I think it's safe to say that any interface
> for construction of filesystems must accept secrets as parameters.
>
> In the C++ library and its bindings, FileSystems can be constructed from a
> URI. This modular interface means that libarrow can construct an
> S3FileSystem even without being compiled with/linked to the AWS SDK. Since
> URIs must be complete specifications of a filesystem, this necessitates
> inclusion of the secrets required by S3 in the URI. Since anyone with a URI
> has access to the filesystem to which it refers, these filesystem URIs are
> transitively secret.
>
> This can and should be better documented, but first we should discuss
> whether URIs-which-are-secrets is an acceptable interface. As a minimal
> example of an alternative design, we could extend the FileSystemFactory
> interface, allowing URIs to reference secrets registered by name elsewhere:
> "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New
> secrets may be added like GetSecretRegistry()->AddSecret({.key =
> "my-s3-secret-key", .secret = "sw0rdf1sh"});)
>
> Is explicit out-of-URI secret management necessary, or is it sufficient to
> document that since filesystem URIs represent access to their referent they
> must be guarded accordingly?
>
> Ben Kietzman
> [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
>


[C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Benjamin Kietzman
I have been working on modularizing the C++ library by extending FileSystem
construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.

Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.

In the C++ library and its bindings, FileSystems can be constructed from a
URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK. Since
URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a URI
has access to the filesystem to which it refers, these filesystem URIs are
transitively secret.

This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name elsewhere:
"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets
may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)

Is explicit out-of-URI secret management necessary, or is it sufficient to
document that since filesystem URIs represent access to their referent they
must be guarded accordingly?

Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077


Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Raphael Taylor-Davies
I'm not all that familiar with the C++ filesystem abstraction, but for 
ObjectStore, the closest equivalent abstraction in the Rust ecosystem, 
we follow what fsspec [1] and Hadoop [2] do and allow providing a set of 
key-value string pairs along with the URI [3]. This provides a great 
deal of flexibility to end-users as to where and how to source this 
configuration, including potentially fetching secrets from other sources 
or the environment.


[1]: 
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem
[2]: 
https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration-

[3]: https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html

On 09/04/2025 15:35, Benjamin Kietzman wrote:

I have been working on modularizing the C++ library by extending FileSystem
construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.

Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.

In the C++ library and its bindings, FileSystems can be constructed from a
URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK. Since
URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a URI
has access to the filesystem to which it refers, these filesystem URIs are
transitively secret.

This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name elsewhere:
"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets
may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)

Is explicit out-of-URI secret management necessary, or is it sufficient to
document that since filesystem URIs represent access to their referent they
must be guarded accordingly?

Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077



Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Bryce Mecum
Hi Ben, would you be able to elaborate on this part:

> Since URIs must be complete specifications of a filesystem, this necessitates 
> inclusion of the secrets required by S3 in the URI. Since anyone with a URI 
> has access to the filesystem to which it refers, these filesystem URIs are 
> transitively secret.

Why must URIs be the complete specification of a filesystem? Why does
having a URI confer access to the resource, or does that point just
follow from your previous?

On Wed, Apr 9, 2025 at 7:37 AM Benjamin Kietzman  wrote:
>
> I have been working on modularizing the C++ library by extending FileSystem
> construction from URIs. I recently merged a PR which prompted some
> discussion [1] of how the library should handle secrets.
>
> Some FileSystems cannot be constructed without one or more secrets. For
> example, an S3FileSystem might require a proxy's username and password in
> order to configure the client which the S3FileSystem wraps. Since the
> usefulness of S3 and other filesystems which may only use default
> credentials is very limited, I think it's safe to say that any interface
> for construction of filesystems must accept secrets as parameters.
>
> In the C++ library and its bindings, FileSystems can be constructed from a
> URI. This modular interface means that libarrow can construct an
> S3FileSystem even without being compiled with/linked to the AWS SDK. Since
> URIs must be complete specifications of a filesystem, this necessitates
> inclusion of the secrets required by S3 in the URI. Since anyone with a URI
> has access to the filesystem to which it refers, these filesystem URIs are
> transitively secret.
>
> This can and should be better documented, but first we should discuss
> whether URIs-which-are-secrets is an acceptable interface. As a minimal
> example of an alternative design, we could extend the FileSystemFactory
> interface, allowing URIs to reference secrets registered by name elsewhere:
> "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets
> may be added like GetSecretRegistry()->AddSecret({.key =
> "my-s3-secret-key", .secret = "sw0rdf1sh"});)
>
> Is explicit out-of-URI secret management necessary, or is it sufficient to
> document that since filesystem URIs represent access to their referent they
> must be guarded accordingly?
>
> Ben Kietzman
> [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077


Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Raphael Taylor-Davies
I can't speak to why Hadoop or fsspec are designed that way, but the 
following come to mind:


- Systems typically draw a separation between system config, such as 
credentials, and the user-supplied URI, which may be provided as part of 
a SQL string, for example
- It avoids needing to define a URI encoding mechanism, which may 
require percent-encoding, etc...
- Related to the above, but the ergonomics of the more explicit approach 
will normally be superior
- There is a good chance of URIs being incorporated into logs and so 
having them contain credentials is problematic, especially if in a 
non-standard encoding that likely won't be sanitised


The ObjectStore design discussion can be found here [1], and was 
designed to consolidate the approaches already incorporated into 
DataFusion and polars which share this separation.


[1]: https://github.com/apache/arrow-rs-object-store/issues/176

On 09/04/2025 16:06, Benjamin Kietzman wrote:

Thanks Raphael,

Do you have a reference which explains the rationale for that separation?
It's not obvious to me what the priorities are.

I can guess that a URI without secrets might be shared between multiple
users,
and their individual tokens etc inserted to grant distinct access. However
for that
case it seems to me that there wouldn't be a significant difference between

 FileSystem.from_uri(uri, **extra_options_and_secrets)
 FileSystem.from_uri(uri_template.format(**extra_options_and_secrets))


On Wed, Apr 9, 2025 at 9:44 AM Raphael Taylor-Davies
 wrote:


I'm not all that familiar with the C++ filesystem abstraction, but for
ObjectStore, the closest equivalent abstraction in the Rust ecosystem,
we follow what fsspec [1] and Hadoop [2] do and allow providing a set of
key-value string pairs along with the URI [3]. This provides a great
deal of flexibility to end-users as to where and how to source this
configuration, including potentially fetching secrets from other sources
or the environment.

[1]:
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem
[2]:

https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration-
[3]:
https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html

On 09/04/2025 15:35, Benjamin Kietzman wrote:

I have been working on modularizing the C++ library by extending

FileSystem

construction from URIs. I recently merged a PR which prompted some
discussion [1] of how the library should handle secrets.

Some FileSystems cannot be constructed without one or more secrets. For
example, an S3FileSystem might require a proxy's username and password in
order to configure the client which the S3FileSystem wraps. Since the
usefulness of S3 and other filesystems which may only use default
credentials is very limited, I think it's safe to say that any interface
for construction of filesystems must accept secrets as parameters.

In the C++ library and its bindings, FileSystems can be constructed from

a

URI. This modular interface means that libarrow can construct an
S3FileSystem even without being compiled with/linked to the AWS SDK.

Since

URIs must be complete specifications of a filesystem, this necessitates
inclusion of the secrets required by S3 in the URI. Since anyone with a

URI

has access to the filesystem to which it refers, these filesystem URIs

are

transitively secret.

This can and should be better documented, but first we should discuss
whether URIs-which-are-secrets is an acceptable interface. As a minimal
example of an alternative design, we could extend the FileSystemFactory
interface, allowing URIs to reference secrets registered by name

elsewhere:

"s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New

secrets

may be added like GetSecretRegistry()->AddSecret({.key =
"my-s3-secret-key", .secret = "sw0rdf1sh"});)

Is explicit out-of-URI secret management necessary, or is it sufficient

to

document that since filesystem URIs represent access to their referent

they

must be guarded accordingly?

Ben Kietzman
[1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077



Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-09 Thread Benjamin Kietzman
Thanks Raphael,

Do you have a reference which explains the rationale for that separation?
It's not obvious to me what the priorities are.

I can guess that a URI without secrets might be shared between multiple
users,
and their individual tokens etc inserted to grant distinct access. However
for that
case it seems to me that there wouldn't be a significant difference between

FileSystem.from_uri(uri, **extra_options_and_secrets)
FileSystem.from_uri(uri_template.format(**extra_options_and_secrets))


On Wed, Apr 9, 2025 at 9:44 AM Raphael Taylor-Davies
 wrote:

> I'm not all that familiar with the C++ filesystem abstraction, but for
> ObjectStore, the closest equivalent abstraction in the Rust ecosystem,
> we follow what fsspec [1] and Hadoop [2] do and allow providing a set of
> key-value string pairs along with the URI [3]. This provides a great
> deal of flexibility to end-users as to where and how to source this
> configuration, including potentially fetching secrets from other sources
> or the environment.
>
> [1]:
> https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem
> [2]:
>
> https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration-
> [3]:
> https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html
>
> On 09/04/2025 15:35, Benjamin Kietzman wrote:
> > I have been working on modularizing the C++ library by extending
> FileSystem
> > construction from URIs. I recently merged a PR which prompted some
> > discussion [1] of how the library should handle secrets.
> >
> > Some FileSystems cannot be constructed without one or more secrets. For
> > example, an S3FileSystem might require a proxy's username and password in
> > order to configure the client which the S3FileSystem wraps. Since the
> > usefulness of S3 and other filesystems which may only use default
> > credentials is very limited, I think it's safe to say that any interface
> > for construction of filesystems must accept secrets as parameters.
> >
> > In the C++ library and its bindings, FileSystems can be constructed from
> a
> > URI. This modular interface means that libarrow can construct an
> > S3FileSystem even without being compiled with/linked to the AWS SDK.
> Since
> > URIs must be complete specifications of a filesystem, this necessitates
> > inclusion of the secrets required by S3 in the URI. Since anyone with a
> URI
> > has access to the filesystem to which it refers, these filesystem URIs
> are
> > transitively secret.
> >
> > This can and should be better documented, but first we should discuss
> > whether URIs-which-are-secrets is an acceptable interface. As a minimal
> > example of an alternative design, we could extend the FileSystemFactory
> > interface, allowing URIs to reference secrets registered by name
> elsewhere:
> > "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New
> secrets
> > may be added like GetSecretRegistry()->AddSecret({.key =
> > "my-s3-secret-key", .secret = "sw0rdf1sh"});)
> >
> > Is explicit out-of-URI secret management necessary, or is it sufficient
> to
> > document that since filesystem URIs represent access to their referent
> they
> > must be guarded accordingly?
> >
> > Ben Kietzman
> > [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
> >
>


Arrow community meeting April 9 at 17:00 UTC

2025-04-09 Thread Ian Cook
Our next biweekly Arrow community meeting is today, Wednesday 9 April 17:00
UTC / 12:00 EDT / 9:00 PDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the document to
add the topics that you would like to discuss.

Ian