Glancing through S3's implementation (sadly, Amazon's documentation seems rather lacking on this issue) I think we will not have much filesystem state/initialization.
The HTTP connection pool appears to be statically instantiated [1] when we initialize S3 [2]. It's possible there is some per-filesystem configuration lookups (the AWS SDK does some filesystem access to look up configuration values in files like ~/.aws/config) but if I'm looking at the source correctly it appears the config is cached so it only looks it up from disk the first time. If you do not specify a region when you create the filesystem then we have to figure out the region ourselves which does involve making an HTTP call to the service. So in that case there is some minor cost to creating a filesystem. [1] https://github.com/aws/aws-sdk-cpp/blob/bb1fdce01cc7e8ae2fe7162f24c8836e9d3ab0a2/aws-cpp-sdk-core/source/http/HttpClientFactory.cpp#L39 [2] https://github.com/apache/arrow/blob/699449f2f5fe36938191d771f321ec15d3fd3331/cpp/src/arrow/filesystem/s3fs.cc#L156 On Sun, Feb 13, 2022 at 12:50 AM Antoine Pitrou <[email protected]> wrote: > > > Hi Micah, > > Le 12/02/2022 à 19:32, Micah Kornfield a écrit : > > Hi Arrow Dev, > > For filesystems that contact remote services is there any sort of > > connection caching to avoid the cost of reconnecting each time? If so, at > > what scope does the cache live (e.g. each FileSystem object, or process > > global)? > > We merely defer to the vendor-specific library such as the AWS SDK for > C++ or the GCS C++ library. That said, it sounds reasonable to expect > some caching at least at the FileSystem object level. > > Regards > > Antoine.
