Re: [C++][Python] Connection caching for File Systems

Weston Pace Mon, 14 Feb 2022 07:54:57 -0800

Glancing through S3's implementation (sadly, Amazon's documentation
seems rather lacking on this issue) I think we will not have much
filesystem state/initialization.

The HTTP connection pool appears to be statically instantiated [1]
when we initialize S3 [2].  It's possible there is some per-filesystem
configuration lookups (the AWS SDK does some filesystem access to look
up configuration values in files like ~/.aws/config) but if I'm
looking at the source correctly it appears the config is cached so it
only looks it up from disk the first time.

If you do not specify a region when you create the filesystem then we
have to figure out the region ourselves which does involve making an
HTTP call to the service.  So in that case there is some minor cost to
creating a filesystem.

[1] 
https://github.com/aws/aws-sdk-cpp/blob/bb1fdce01cc7e8ae2fe7162f24c8836e9d3ab0a2/aws-cpp-sdk-core/source/http/HttpClientFactory.cpp#L39
[2] 
https://github.com/apache/arrow/blob/699449f2f5fe36938191d771f321ec15d3fd3331/cpp/src/arrow/filesystem/s3fs.cc#L156

On Sun, Feb 13, 2022 at 12:50 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Hi Micah,
>
> Le 12/02/2022 à 19:32, Micah Kornfield a écrit :
> > Hi Arrow Dev,
> > For filesystems that contact remote services is there any sort of
> > connection caching to avoid the cost of reconnecting each time?  If so, at
> > what scope does the cache live (e.g. each FileSystem object, or process
> > global)?
>
> We merely defer to the vendor-specific library such as the AWS SDK for
> C++ or the GCS C++ library. That said, it sounds reasonable to expect
> some caching at least at the FileSystem object level.
>
> Regards
>
> Antoine.

Re: [C++][Python] Connection caching for File Systems

Reply via email to