Re: [C++][DISCUSS] FileSystem construction from URIs and secrets
Potentially useful further context: the current C++ FileSystem documentation at https://arrow.apache.org/docs/cpp/io.html#filesystems On Wed, Apr 9, 2025 at 9:35 AM Benjamin Kietzman wrote: > I have been working on modularizing the C++ library by extending > FileSystem construction from URIs. I recently merged a PR which prompted > some discussion [1] of how the library should handle secrets. > > Some FileSystems cannot be constructed without one or more secrets. For > example, an S3FileSystem might require a proxy's username and password in > order to configure the client which the S3FileSystem wraps. Since the > usefulness of S3 and other filesystems which may only use default > credentials is very limited, I think it's safe to say that any interface > for construction of filesystems must accept secrets as parameters. > > In the C++ library and its bindings, FileSystems can be constructed from a > URI. This modular interface means that libarrow can construct an > S3FileSystem even without being compiled with/linked to the AWS SDK. Since > URIs must be complete specifications of a filesystem, this necessitates > inclusion of the secrets required by S3 in the URI. Since anyone with a URI > has access to the filesystem to which it refers, these filesystem URIs are > transitively secret. > > This can and should be better documented, but first we should discuss > whether URIs-which-are-secrets is an acceptable interface. As a minimal > example of an alternative design, we could extend the FileSystemFactory > interface, allowing URIs to reference secrets registered by name elsewhere: > "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New > secrets may be added like GetSecretRegistry()->AddSecret({.key = > "my-s3-secret-key", .secret = "sw0rdf1sh"});) > > Is explicit out-of-URI secret management necessary, or is it sufficient to > document that since filesystem URIs represent access to their referent they > must be guarded accordingly? > > Ben Kietzman > [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077 >
[C++][DISCUSS] FileSystem construction from URIs and secrets
I have been working on modularizing the C++ library by extending FileSystem construction from URIs. I recently merged a PR which prompted some discussion [1] of how the library should handle secrets. Some FileSystems cannot be constructed without one or more secrets. For example, an S3FileSystem might require a proxy's username and password in order to configure the client which the S3FileSystem wraps. Since the usefulness of S3 and other filesystems which may only use default credentials is very limited, I think it's safe to say that any interface for construction of filesystems must accept secrets as parameters. In the C++ library and its bindings, FileSystems can be constructed from a URI. This modular interface means that libarrow can construct an S3FileSystem even without being compiled with/linked to the AWS SDK. Since URIs must be complete specifications of a filesystem, this necessitates inclusion of the secrets required by S3 in the URI. Since anyone with a URI has access to the filesystem to which it refers, these filesystem URIs are transitively secret. This can and should be better documented, but first we should discuss whether URIs-which-are-secrets is an acceptable interface. As a minimal example of an alternative design, we could extend the FileSystemFactory interface, allowing URIs to reference secrets registered by name elsewhere: "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets may be added like GetSecretRegistry()->AddSecret({.key = "my-s3-secret-key", .secret = "sw0rdf1sh"});) Is explicit out-of-URI secret management necessary, or is it sufficient to document that since filesystem URIs represent access to their referent they must be guarded accordingly? Ben Kietzman [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
Re: [C++][DISCUSS] FileSystem construction from URIs and secrets
I'm not all that familiar with the C++ filesystem abstraction, but for ObjectStore, the closest equivalent abstraction in the Rust ecosystem, we follow what fsspec [1] and Hadoop [2] do and allow providing a set of key-value string pairs along with the URI [3]. This provides a great deal of flexibility to end-users as to where and how to source this configuration, including potentially fetching secrets from other sources or the environment. [1]: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem [2]: https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration- [3]: https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html On 09/04/2025 15:35, Benjamin Kietzman wrote: I have been working on modularizing the C++ library by extending FileSystem construction from URIs. I recently merged a PR which prompted some discussion [1] of how the library should handle secrets. Some FileSystems cannot be constructed without one or more secrets. For example, an S3FileSystem might require a proxy's username and password in order to configure the client which the S3FileSystem wraps. Since the usefulness of S3 and other filesystems which may only use default credentials is very limited, I think it's safe to say that any interface for construction of filesystems must accept secrets as parameters. In the C++ library and its bindings, FileSystems can be constructed from a URI. This modular interface means that libarrow can construct an S3FileSystem even without being compiled with/linked to the AWS SDK. Since URIs must be complete specifications of a filesystem, this necessitates inclusion of the secrets required by S3 in the URI. Since anyone with a URI has access to the filesystem to which it refers, these filesystem URIs are transitively secret. This can and should be better documented, but first we should discuss whether URIs-which-are-secrets is an acceptable interface. As a minimal example of an alternative design, we could extend the FileSystemFactory interface, allowing URIs to reference secrets registered by name elsewhere: "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets may be added like GetSecretRegistry()->AddSecret({.key = "my-s3-secret-key", .secret = "sw0rdf1sh"});) Is explicit out-of-URI secret management necessary, or is it sufficient to document that since filesystem URIs represent access to their referent they must be guarded accordingly? Ben Kietzman [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
Re: [C++][DISCUSS] FileSystem construction from URIs and secrets
Hi Ben, would you be able to elaborate on this part: > Since URIs must be complete specifications of a filesystem, this necessitates > inclusion of the secrets required by S3 in the URI. Since anyone with a URI > has access to the filesystem to which it refers, these filesystem URIs are > transitively secret. Why must URIs be the complete specification of a filesystem? Why does having a URI confer access to the resource, or does that point just follow from your previous? On Wed, Apr 9, 2025 at 7:37 AM Benjamin Kietzman wrote: > > I have been working on modularizing the C++ library by extending FileSystem > construction from URIs. I recently merged a PR which prompted some > discussion [1] of how the library should handle secrets. > > Some FileSystems cannot be constructed without one or more secrets. For > example, an S3FileSystem might require a proxy's username and password in > order to configure the client which the S3FileSystem wraps. Since the > usefulness of S3 and other filesystems which may only use default > credentials is very limited, I think it's safe to say that any interface > for construction of filesystems must accept secrets as parameters. > > In the C++ library and its bindings, FileSystems can be constructed from a > URI. This modular interface means that libarrow can construct an > S3FileSystem even without being compiled with/linked to the AWS SDK. Since > URIs must be complete specifications of a filesystem, this necessitates > inclusion of the secrets required by S3 in the URI. Since anyone with a URI > has access to the filesystem to which it refers, these filesystem URIs are > transitively secret. > > This can and should be better documented, but first we should discuss > whether URIs-which-are-secrets is an acceptable interface. As a minimal > example of an alternative design, we could extend the FileSystemFactory > interface, allowing URIs to reference secrets registered by name elsewhere: > "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets > may be added like GetSecretRegistry()->AddSecret({.key = > "my-s3-secret-key", .secret = "sw0rdf1sh"});) > > Is explicit out-of-URI secret management necessary, or is it sufficient to > document that since filesystem URIs represent access to their referent they > must be guarded accordingly? > > Ben Kietzman > [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
Re: [C++][DISCUSS] FileSystem construction from URIs and secrets
I can't speak to why Hadoop or fsspec are designed that way, but the following come to mind: - Systems typically draw a separation between system config, such as credentials, and the user-supplied URI, which may be provided as part of a SQL string, for example - It avoids needing to define a URI encoding mechanism, which may require percent-encoding, etc... - Related to the above, but the ergonomics of the more explicit approach will normally be superior - There is a good chance of URIs being incorporated into logs and so having them contain credentials is problematic, especially if in a non-standard encoding that likely won't be sanitised The ObjectStore design discussion can be found here [1], and was designed to consolidate the approaches already incorporated into DataFusion and polars which share this separation. [1]: https://github.com/apache/arrow-rs-object-store/issues/176 On 09/04/2025 16:06, Benjamin Kietzman wrote: Thanks Raphael, Do you have a reference which explains the rationale for that separation? It's not obvious to me what the priorities are. I can guess that a URI without secrets might be shared between multiple users, and their individual tokens etc inserted to grant distinct access. However for that case it seems to me that there wouldn't be a significant difference between FileSystem.from_uri(uri, **extra_options_and_secrets) FileSystem.from_uri(uri_template.format(**extra_options_and_secrets)) On Wed, Apr 9, 2025 at 9:44 AM Raphael Taylor-Davies wrote: I'm not all that familiar with the C++ filesystem abstraction, but for ObjectStore, the closest equivalent abstraction in the Rust ecosystem, we follow what fsspec [1] and Hadoop [2] do and allow providing a set of key-value string pairs along with the URI [3]. This provides a great deal of flexibility to end-users as to where and how to source this configuration, including potentially fetching secrets from other sources or the environment. [1]: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem [2]: https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration- [3]: https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html On 09/04/2025 15:35, Benjamin Kietzman wrote: I have been working on modularizing the C++ library by extending FileSystem construction from URIs. I recently merged a PR which prompted some discussion [1] of how the library should handle secrets. Some FileSystems cannot be constructed without one or more secrets. For example, an S3FileSystem might require a proxy's username and password in order to configure the client which the S3FileSystem wraps. Since the usefulness of S3 and other filesystems which may only use default credentials is very limited, I think it's safe to say that any interface for construction of filesystems must accept secrets as parameters. In the C++ library and its bindings, FileSystems can be constructed from a URI. This modular interface means that libarrow can construct an S3FileSystem even without being compiled with/linked to the AWS SDK. Since URIs must be complete specifications of a filesystem, this necessitates inclusion of the secrets required by S3 in the URI. Since anyone with a URI has access to the filesystem to which it refers, these filesystem URIs are transitively secret. This can and should be better documented, but first we should discuss whether URIs-which-are-secrets is an acceptable interface. As a minimal example of an alternative design, we could extend the FileSystemFactory interface, allowing URIs to reference secrets registered by name elsewhere: "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New secrets may be added like GetSecretRegistry()->AddSecret({.key = "my-s3-secret-key", .secret = "sw0rdf1sh"});) Is explicit out-of-URI secret management necessary, or is it sufficient to document that since filesystem URIs represent access to their referent they must be guarded accordingly? Ben Kietzman [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077
Re: [C++][DISCUSS] FileSystem construction from URIs and secrets
Thanks Raphael, Do you have a reference which explains the rationale for that separation? It's not obvious to me what the priorities are. I can guess that a URI without secrets might be shared between multiple users, and their individual tokens etc inserted to grant distinct access. However for that case it seems to me that there wouldn't be a significant difference between FileSystem.from_uri(uri, **extra_options_and_secrets) FileSystem.from_uri(uri_template.format(**extra_options_and_secrets)) On Wed, Apr 9, 2025 at 9:44 AM Raphael Taylor-Davies wrote: > I'm not all that familiar with the C++ filesystem abstraction, but for > ObjectStore, the closest equivalent abstraction in the Rust ecosystem, > we follow what fsspec [1] and Hadoop [2] do and allow providing a set of > key-value string pairs along with the URI [3]. This provides a great > deal of flexibility to end-users as to where and how to source this > configuration, including potentially fetching secrets from other sources > or the environment. > > [1]: > https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.filesystem > [2]: > > https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/fs/FileSystem.html#get-java.net.URI-org.apache.hadoop.conf.Configuration- > [3]: > https://docs.rs/object_store/latest/object_store/fn.parse_url_opts.html > > On 09/04/2025 15:35, Benjamin Kietzman wrote: > > I have been working on modularizing the C++ library by extending > FileSystem > > construction from URIs. I recently merged a PR which prompted some > > discussion [1] of how the library should handle secrets. > > > > Some FileSystems cannot be constructed without one or more secrets. For > > example, an S3FileSystem might require a proxy's username and password in > > order to configure the client which the S3FileSystem wraps. Since the > > usefulness of S3 and other filesystems which may only use default > > credentials is very limited, I think it's safe to say that any interface > > for construction of filesystems must accept secrets as parameters. > > > > In the C++ library and its bindings, FileSystems can be constructed from > a > > URI. This modular interface means that libarrow can construct an > > S3FileSystem even without being compiled with/linked to the AWS SDK. > Since > > URIs must be complete specifications of a filesystem, this necessitates > > inclusion of the secrets required by S3 in the URI. Since anyone with a > URI > > has access to the filesystem to which it refers, these filesystem URIs > are > > transitively secret. > > > > This can and should be better documented, but first we should discuss > > whether URIs-which-are-secrets is an acceptable interface. As a minimal > > example of an alternative design, we could extend the FileSystemFactory > > interface, allowing URIs to reference secrets registered by name > elsewhere: > > "s3://{my-s3-key}:{my-s3-secret-key}@.../{my-secret-bucket}". (New > secrets > > may be added like GetSecretRegistry()->AddSecret({.key = > > "my-s3-secret-key", .secret = "sw0rdf1sh"});) > > > > Is explicit out-of-URI secret management necessary, or is it sufficient > to > > document that since filesystem URIs represent access to their referent > they > > must be guarded accordingly? > > > > Ben Kietzman > > [1] https://github.com/apache/arrow/pull/41559#discussion_r1768836077 > > >
Arrow community meeting April 9 at 17:00 UTC
Our next biweekly Arrow community meeting is today, Wednesday 9 April 17:00 UTC / 12:00 EDT / 9:00 PDT. Zoom meeting URL: https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09 Meeting ID: 876 4903 3008 Passcode: 958092 Meeting notes will be captured in this Google Doc: https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/ If you plan to attend this meeting, you are welcome to edit the document to add the topics that you would like to discuss. Ian