Ok. I think I have it figured out as: num_rows = 0 dataset = pa.dataset.dataset(short_files, filesystem=subtree_filesystem) for fragment in dataset.get_fragments(): fragment.ensure_complete_metadata() if fragment.row_groups: for row_group in fragment.row_groups: num_rows += row_group.num_rows
On Wed, Aug 26, 2020 at 10:06 AM Weston Pace <weston.p...@gmail.com> wrote: > > Thanks Joris / Antoine, > > It appears I will have to learn the new datasets API. I can confirm > that SubTreeFileSystem is working for me. In case there is still > interest here is the code I had from before reproducing the issue: > https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d > > It looks like the new ParquetDataset (_ParquetDatasetV2) is protected > and also that `pieces` is deprecated. I was previously using that for > filtering pieces based on metadata statistics (it looks like the new > "filters" feature takes care of this for me) as well as accessing > piece metadata to count the number of rows in the dataset without > loading anything other than the metadata. Do you know off the top of > your head what would be a good approach to count the rows in that way? > > On Wed, Aug 26, 2020 at 4:51 AM Joris Van den Bossche > <jorisvandenboss...@gmail.com> wrote: > > > > Hi Weston, > > > > Currently there are two filesystems interfaces in pyarrow, a legacy one in > > `pyarrow.filesystem` and a new one in `pyarrow.fs` (see > > https://issues.apache.org/jira/browse/ARROW-9645 and > > https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are > > still a bit scarce). > > > > Based on your description, I assume you are using the "legacy" > > LocalFileSystem. > > In the new filesystems, however, I think there is already the feature you > > are looking for, called "SubTreeFileSystem", created from a base directory > > and other filesystem instance. > > > > Best, > > Joris > > > > > > On Tue, 25 Aug 2020 at 23:38, Weston Pace <weston.p...@gmail.com> wrote: > > > > > I created a RelativeFileSystem that extended FileSystem and proxied > > > calls to a LocalFileSystem instance. This filesystem allowed me to > > > specify a base directory and then all paths were resolved relative to > > > that base directory (so fs.open("foo.parquet") became > > > self.target.open("C:\Datadir\foo.parquet"). > > > > > > However, because it was not a LocalFileSystem instance it was treated > > > differently by arrow at: > > > > > > > > > https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043 > > > > > > Instead of using a native file reader the open method was called and > > > it read from a python file object. Besides the performance impact I > > > also received a "ResourceWarning: unclosed file" when running `read` > > > on a dataset piece. > > > > > > To avoid these warnings I changed RelativeFileSystem to subclass > > > LocalFileSystem instead of proxy to it. > > > > > > Is this the recommended approach for reading local files? If so I can > > > probably add something to the filesystems docs. Part of the problem > > > is that the undesired behavior can be difficult to detect. Had I not > > > been running with warnings on I would not have noticed the > > > ResourceWarning or, if that ResourceWarning is patched away, I > > > probably would never have noticed it until I realized my performance > > > dropped for some reason. > > >