Ok. I think I have it figured out as:
num_rows = 0
dataset = pa.dataset.dataset(short_files, filesystem=subtree_filesystem)
for fragment in dataset.get_fragments():
fragment.ensure_complete_metadata()
if fragment.row_groups:
for row_group in fragment.row_groups:
num_ro
Thanks Joris / Antoine,
It appears I will have to learn the new datasets API. I can confirm
that SubTreeFileSystem is working for me. In case there is still
interest here is the code I had from before reproducing the issue:
https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d
It
Hi Weston,
Currently there are two filesystems interfaces in pyarrow, a legacy one in
`pyarrow.filesystem` and a new one in `pyarrow.fs` (see
https://issues.apache.org/jira/browse/ARROW-9645 and
https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are
still a bit scarce).
Based
Hi Weston,
Can you show the code for your experiment?
(or post equivalent code)
Regards
Antoine.
Le 25/08/2020 à 23:38, Weston Pace a écrit :
> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a LocalFileSystem instance. This filesystem allowed me to
> specify
Actually my workaround (extending LocalFileSystem) does not work since
`open` is never called in this case and the path is not normalized to
the base directory.
On Tue, Aug 25, 2020 at 11:38 AM Weston Pace wrote:
>
> I created a RelativeFileSystem that extended FileSystem and proxied
> calls to a
I created a RelativeFileSystem that extended FileSystem and proxied
calls to a LocalFileSystem instance. This filesystem allowed me to
specify a base directory and then all paths were resolved relative to
that base directory (so fs.open("foo.parquet") became
self.target.open("C:\Datadir\foo.parque