Thanks for sharing! It's cool to see the new PyFileSystem directly being used ;)
Note that there is also an fsspec-compatible Azule filesystem implementation that should support Data Lake Gen2 ( https://github.com/dask/adlfs) for another python-based implemenation, and which can be used with pyarrow (similarly using PyFileSystem, using the built-in FSSpecHandler). (now, I am not familiar with that package / Azure, so can't judge any potential differences) Some other answers below inline: On Thu, 3 Sep 2020 at 10:45, Robin Kåveland Hansen <kaavel...@gmail.com> wrote: > Hi, > > We use Azure Data Lake gen2 heavily at work, and with 1.0 including > pyarrow.fs.PyFileSystem it wasn't that hard to add filesystem support > for it. My employer was happy to let me release it, so I'm getting it > out there. > > For now, I published to a pypi package: > https://pypi.org/project/pyarrowfs-adlgen2/ > > If you're not familiar with Azure Data Lake gen2, it's essentially the > same thing as S3 or Azure Blob Storage, but with real file system > support, meaning operations such as renaming directories or listing > directory contents are practically instant. Directory renames are > atomic, unlike with blob storage, where if some blob rename operations > fail, you may be left with only some files being "moved". > > If there's any interest in including this into pyarrow, I'd be happy to > take on some work to do that to make it fit there, but I'm also OK > maintaining this myself. > We are certainly interested in Azure support (there are open issue for Blob Storage (https://issues.apache.org/jira/browse/ARROW-2034) and Data Lake ( https://issues.apache.org/jira/browse/ARROW-9611)). But I *think* if adding it to pyarrow, we will prefer a C++ implementation, so it can also be used in the other bindings (e.g. the R bindings). > I couldn't get this working well with writing datasets, but I think that > there's work in progress on pyarrow.fs being supported everywhere > in the parquet codebase? > Indeed, in the released version, only reading works. In the meantime, writing is also starting to be supported in the development version using the master branch (general support for writing datasets was merged in https://github.com/apache/arrow/pull/7921/, and there are open PRs to further integrate it in the pyarrow.parquet module). Best, Joris > > -- > Kind regards, > Robin Kåveland > >