Hi Martin, On Wed, 9 May 2018 11:28:15 -0400 Martin Durant <martin.dur...@utoronto.ca> wrote: > I have sketched out a possible start of a python-wide file-system > specification > https://github.com/martindurant/filesystem_spec > > This came about from my work in some other (remote) file-systems > implementations for python, particularly in the context of Dask. Since arrow > also cares about both local files and, for example, hdfs, I thought that > people on this list may have comments and opinions about a possible standard > that we ought to converge on. I do not think that my suggestions so far are > necessarily right or even good in many cases, but I want to get the > conversation going.
Here are some comments: - API naming: you seem to favour re-using Unix command-line monickers in some places, while using more regular verbs or names in other places. I think it should be consistent. Since the Unix command-line doesn't exactly cover the exposed functionality, and since Unix tends to favour short cryptic names, I think it's better to use Python-like naming (which is also more familiar to non-Unix users). For example "move" or "rename" or "replace" instead of "mv", etc. - **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing arbitrary parameters, which I assume are intended to be backend-specific. It makes it difficult to add other optional parameters to those APIs in the future. So I'd make the backend-specific directives a single (optional) dict parameter rather than a **kwargs. - `invalidate_cache` doesn't state whether it invalidates recursively or not (recursively sounds better intuitively?). Also, I think it would be more flexible to take a list of paths rather than a single path. - `du`: the effect of the `deep` parameter isn't obvious to me. I don't know what it would mean *not* to recurse here: what is the size of a directory if you don't recurse into it? - `glob` may need a formal definition (are trailing slashes significant for directory or symlink resolution? this kind of thing), though you may want to keep edge cases backend-specific. - are `head` and `tail` at all useful? They can be easily recreated using a generic `open` facility. - `read_block` tries to do too much in a single API IMHO, and using `open` directly is more flexible anyway. - if `touch` is intended to emulate the Unix API of the same name, the docstring should state "Create empty file or update last modification timestamp". - the information dicts returned by several APIs (`ls`, `info`....) need standardizing, at least for non backend-specific fields. - if the backend is a networked filesystem with non-trivial latency, perhaps the operations would deserve being batched (operate on several paths at once), though I will happily defer to your expertise on the topic. Regards Antoine.