I created a RelativeFileSystem that extended FileSystem and proxied
calls to a LocalFileSystem instance.  This filesystem allowed me to
specify a base directory and then all paths were resolved relative to
that base directory (so fs.open("foo.parquet") became
self.target.open("C:\Datadir\foo.parquet").

However, because it was not a LocalFileSystem instance it was treated
differently by arrow at:

https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043

Instead of using a native file reader the open method was called and
it read from a python file object.  Besides the performance impact I
also received a "ResourceWarning: unclosed file" when running `read`
on a dataset piece.

To avoid these warnings I changed RelativeFileSystem to subclass
LocalFileSystem instead of proxy to it.

Is this the recommended approach for reading local files?  If so I can
probably add something to the filesystems docs.  Part of the problem
is that the undesired behavior can be difficult to detect.  Had I not
been running with warnings on I would not have noticed the
ResourceWarning or, if that ResourceWarning is patched away, I
probably would never have noticed it until I realized my performance
dropped for some reason.

Reply via email to