Thanks Joris / Antoine, It appears I will have to learn the new datasets API. I can confirm that SubTreeFileSystem is working for me. In case there is still interest here is the code I had from before reproducing the issue: https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d
It looks like the new ParquetDataset (_ParquetDatasetV2) is protected and also that `pieces` is deprecated. I was previously using that for filtering pieces based on metadata statistics (it looks like the new "filters" feature takes care of this for me) as well as accessing piece metadata to count the number of rows in the dataset without loading anything other than the metadata. Do you know off the top of your head what would be a good approach to count the rows in that way? On Wed, Aug 26, 2020 at 4:51 AM Joris Van den Bossche <[email protected]> wrote: > > Hi Weston, > > Currently there are two filesystems interfaces in pyarrow, a legacy one in > `pyarrow.filesystem` and a new one in `pyarrow.fs` (see > https://issues.apache.org/jira/browse/ARROW-9645 and > https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are > still a bit scarce). > > Based on your description, I assume you are using the "legacy" > LocalFileSystem. > In the new filesystems, however, I think there is already the feature you > are looking for, called "SubTreeFileSystem", created from a base directory > and other filesystem instance. > > Best, > Joris > > > On Tue, 25 Aug 2020 at 23:38, Weston Pace <[email protected]> wrote: > > > I created a RelativeFileSystem that extended FileSystem and proxied > > calls to a LocalFileSystem instance. This filesystem allowed me to > > specify a base directory and then all paths were resolved relative to > > that base directory (so fs.open("foo.parquet") became > > self.target.open("C:\Datadir\foo.parquet"). > > > > However, because it was not a LocalFileSystem instance it was treated > > differently by arrow at: > > > > > > https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043 > > > > Instead of using a native file reader the open method was called and > > it read from a python file object. Besides the performance impact I > > also received a "ResourceWarning: unclosed file" when running `read` > > on a dataset piece. > > > > To avoid these warnings I changed RelativeFileSystem to subclass > > LocalFileSystem instead of proxy to it. > > > > Is this the recommended approach for reading local files? If so I can > > probably add something to the filesystems docs. Part of the problem > > is that the undesired behavior can be difficult to detect. Had I not > > been running with warnings on I would not have noticed the > > ResourceWarning or, if that ResourceWarning is patched away, I > > probably would never have noticed it until I realized my performance > > dropped for some reason. > >
