RE: Listing large directories via WebHDFS

2016-10-19 Thread Brahma Reddy Battula
: Listing large directories via WebHDFS If the issue is just "hadoop fs -ls -R /", one thing we can look into is making the Globber use the listStatus API that returns a RemoteIterator rather than a FileStatus[]. That'll use the client-side pagination Xiao mentioned for WebHDFS/Http

Re: Listing large directories via WebHDFS

2016-10-19 Thread Andrew Wang
If the issue is just "hadoop fs -ls -R /", one thing we can look into is making the Globber use the listStatus API that returns a RemoteIterator rather than a FileStatus[]. That'll use the client-side pagination Xiao mentioned for WebHDFS/HttpFS (though this is currently not in a 2.x release). The

Re: Listing large directories via WebHDFS

2016-10-19 Thread Zhe Zhang
Thanks Xiao! Seems like server-side throttling are still vulnerable to abusing users issuing large listing requests. Once such a request is scheduled, it will keep listing potentially millions of files without having to go through IPC/RPC queue again. It does have to compete for fsn lock though, t

Re: Listing large directories via WebHDFS

2016-10-19 Thread Xiao Chen
Hi Zhe, Per my understanding, the runner in webhdfs goes to NamenodeWebHdfsMethods , w

Listing large directories via WebHDFS

2016-10-19 Thread Zhe Zhang
Hi, The regular HDFS client (DistributedFileSystem) throttles the workload of listing large directories by dividing the work into batches, something like below: {code} // fetch the first batch of entries in the directory DirectoryListing thisListing = dfs.listPaths( src, HdfsFileSt