Hi Weston,

Currently there are two filesystems interfaces in pyarrow, a legacy one in
`pyarrow.filesystem` and a new one in `pyarrow.fs` (see
https://issues.apache.org/jira/browse/ARROW-9645 and
https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
still a bit scarce).

Based on your description, I assume you are using the "legacy"
LocalFileSystem.
In the new filesystems, however, I think there is already the feature you
are looking for, called "SubTreeFileSystem", created from a base directory
and other filesystem instance.

Best,
Joris


On Tue, 25 Aug 2020 at 23:38, Weston Pace <weston.p...@gmail.com> wrote:

> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a LocalFileSystem instance.  This filesystem allowed me to
> specify a base directory and then all paths were resolved relative to
> that base directory (so fs.open("foo.parquet") became
> self.target.open("C:\Datadir\foo.parquet").
>
> However, because it was not a LocalFileSystem instance it was treated
> differently by arrow at:
>
>
> https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
>
> Instead of using a native file reader the open method was called and
> it read from a python file object.  Besides the performance impact I
> also received a "ResourceWarning: unclosed file" when running `read`
> on a dataset piece.
>
> To avoid these warnings I changed RelativeFileSystem to subclass
> LocalFileSystem instead of proxy to it.
>
> Is this the recommended approach for reading local files?  If so I can
> probably add something to the filesystems docs.  Part of the problem
> is that the undesired behavior can be difficult to detect.  Had I not
> been running with warnings on I would not have noticed the
> ResourceWarning or, if that ResourceWarning is patched away, I
> probably would never have noticed it until I realized my performance
> dropped for some reason.
>

Reply via email to