Steve, You made a great point and I'm sorry this API was implemented without consideration of other FS implementation. Thank you for your direct feedback.
async -- yes builder -- yes cancellable -- totally agree There are good use cases for this API though -- Impala and Presto both require lots of file system metadata operation, and this API would make them much more efficient. On top of that, I would also like to have a batched delete API. HBase could benefit a lot from that. On Fri, Feb 28, 2020 at 5:48 AM Steve Loughran <ste...@cloudera.com.invalid> wrote: > https://issues.apache.org/jira/browse/HDFS-13616 > > I don't want to be territorial here -but as I keep reminding this list > whenever it happens, -I do not want any changes to go into the core > FileSystem class without > > * raising a HADOOP- JIRA > * involving those of us who work on object stores. We have different > problems (latencies, failure modes) and want to move to move > async/completable APIs, ideally with builder APIs for future flexibility > and per-FS options. > * specify semantics formally enough that people implementing and using know > what they get. > * a specification in the filesystem.md > * contract tests to match the spec and which object stores can implement, > as well as HDFS > > The change has ~no javadocs and doesn't even state > * whether it's recursive or not. > * whether it includes directories or not > > batchedListStatusIterator is exactly the kind of feature this should apply > to -it is where we get a chance to fix those limitations of the previous > calls (blocking sync, no expectation of right to cancel listings), ... > > I'd like to be able to > * provide a hint on batch sizes. > * get an async response so the fact the LIST can can take time is more > visible. > * and let us cancel that query if it is taking too long > > I also like to be able to close an iterator too; that is something we > can/should retrofit, or require all implementations to add > > > Completable<RemoteIterator<PartialListing<S extends FileStatus>> listing = > batchList(Path) > .recursive(true) > .opt("fs.option.batchlist.size", 100) > .build() > > RemoteIterator<PartialListing<FileStatus> it = listing.get() > > FileStatus largeFile = null; > > try { > while(it.hasNext()) { > FileStatus st = it.next(); > if (st.length()> 1_000_000) { > largeFile = st; > break; > } > } finally { > if (it instanceof Closeable) { > IOUtils.closeQuietly((Closeable)it); > } > } > > if (largeFile != null) { > processLargeFile(largeFile); > } > } > > See: something for slower IO, controllable batch sizes and a way to cancel > the scan -so let us recycle the HTTP connection even when breaking out > early. > > This is a recurrent problem and I am getting as bored as a sending these > emails out as people probably are at receiving them. > > Please please at least talk to me. Yes I'm going to add more homework but > the goal is to make it something well documented well testable and > straightforward to implement by other implementations without us having to > reverse engineer HDFS's behaviour and consider that a normative > > What I do here? > 1. Do I overreact and revert the change until my needs are met? Because I > know that if I volunteered to do this work myself it's going to get > neglected. > 2. Is someone going to put their hand up to help this? > > At the very least, I'm going to tag the APIs as unstable and potentially > likely to break so that anyone who uses it in hadoop-3.3.0 isn't going to > be upset when it is moved to a builder API. And it will have to for the > objects stores. > > sorry > > steve >