Ok.  I think I have it figured out as:

num_rows = 0
dataset = pa.dataset.dataset(short_files, filesystem=subtree_filesystem)
for fragment in dataset.get_fragments():
    fragment.ensure_complete_metadata()
    if fragment.row_groups:
        for row_group in fragment.row_groups:
            num_rows += row_group.num_rows

On Wed, Aug 26, 2020 at 10:06 AM Weston Pace <weston.p...@gmail.com> wrote:
>
> Thanks Joris / Antoine,
>
> It appears I will have to learn the new datasets API.  I can confirm
> that SubTreeFileSystem is working for me.  In case there is still
> interest here is the code I had from before reproducing the issue:
> https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d
>
> It looks like the new ParquetDataset (_ParquetDatasetV2) is protected
> and also that `pieces` is deprecated.  I was previously using that for
> filtering pieces based on metadata statistics (it looks like the new
> "filters" feature takes care of this for me) as well as accessing
> piece metadata to count the number of rows in the dataset without
> loading anything other than the metadata.  Do you know off the top of
> your head what would be a good approach to count the rows in that way?
>
> On Wed, Aug 26, 2020 at 4:51 AM Joris Van den Bossche
> <jorisvandenboss...@gmail.com> wrote:
> >
> > Hi Weston,
> >
> > Currently there are two filesystems interfaces in pyarrow, a legacy one in
> > `pyarrow.filesystem` and a new one in `pyarrow.fs` (see
> > https://issues.apache.org/jira/browse/ARROW-9645 and
> > https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
> > still a bit scarce).
> >
> > Based on your description, I assume you are using the "legacy"
> > LocalFileSystem.
> > In the new filesystems, however, I think there is already the feature you
> > are looking for, called "SubTreeFileSystem", created from a base directory
> > and other filesystem instance.
> >
> > Best,
> > Joris
> >
> >
> > On Tue, 25 Aug 2020 at 23:38, Weston Pace <weston.p...@gmail.com> wrote:
> >
> > > I created a RelativeFileSystem that extended FileSystem and proxied
> > > calls to a LocalFileSystem instance.  This filesystem allowed me to
> > > specify a base directory and then all paths were resolved relative to
> > > that base directory (so fs.open("foo.parquet") became
> > > self.target.open("C:\Datadir\foo.parquet").
> > >
> > > However, because it was not a LocalFileSystem instance it was treated
> > > differently by arrow at:
> > >
> > >
> > > https://github.com/apache/arrow/blob/de8bfddae8704a998d910f2a84bd1e2f7bd934d1/python/pyarrow/parquet.py#L1043
> > >
> > > Instead of using a native file reader the open method was called and
> > > it read from a python file object.  Besides the performance impact I
> > > also received a "ResourceWarning: unclosed file" when running `read`
> > > on a dataset piece.
> > >
> > > To avoid these warnings I changed RelativeFileSystem to subclass
> > > LocalFileSystem instead of proxy to it.
> > >
> > > Is this the recommended approach for reading local files?  If so I can
> > > probably add something to the filesystems docs.  Part of the problem
> > > is that the undesired behavior can be difficult to detect.  Had I not
> > > been running with warnings on I would not have noticed the
> > > ResourceWarning or, if that ResourceWarning is patched away, I
> > > probably would never have noticed it until I realized my performance
> > > dropped for some reason.
> > >

Reply via email to