Hi Steve, Thanks for the response. Since writing the original email, I've received additional information.
WebHDFS does redirect to you the datanode containing the first block are you are requesting. This can be abused to query data locality information but it is inefficient. Since getFileBlockLocation is part of the FileSystem API and implemented by all FileSystem clients used by Hadoop, I think it should be considered part of the public API. I've also learned that the getFileBlockLocation is available in the WebHDFS code -- it's just not part of the documentation. This is used, however, by the WebHDFS client in Hadoop. Thus, the issues seem to be: 1) Why is it private if the functionality is there? 2) Can we make it public and add it to the docs? 3) If not, I think there is a larger question that needs to be asked about the abstractions used by the FileSystem API which WebHDFS has mimicked in its API. I opened a JIRA (HDFS-6116) describing what I've learned and asking for feedback from the community. Thank you! On Wed, Mar 19, 2014 at 5:41 AM, Steve Loughran <ste...@hortonworks.com> wrote: > 1. All the specifics of Hadoop's operations are hidden in the source. > That's a get-out clause of OSS, I know, but sometimes it's the clearest. > 2. For webhdfs I suspect it picks a local node with the data -you'd have > to experiment to make sure > 3. If webhdfs is missing fetaures, I'm sure they'd be welcome > 4. Hadoop 2.2+ uses protobuf for stable and cross platform IPC. , the > listLocatedStatus() call on a filesystem will give you all the locations of > blocks. Using that is another option -and probably higher performance- but > is going to require more upfront engineering than GET calls. Sticking to > WebHDFS -and extending it- is probably simpler > > > > On 17 March 2014 17:29, RJ Nowling <rnowl...@gmail.com> wrote: > >> Hi all, >> >> I sent an email to user@ but no one there was able to answer my question. >> I hope you don't mind me emailing hdfs-dev@ about it. >> >> I'm submitting a proposal to Google Summer of Code to add support for HDFS >> to Disco, an Erlang MapReduce system. We're looking at using WebHDFS. As >> with Hadoop, we need information about the locality of the file blocks so >> that we can schedule tasks accordingly. >> >> WebHDFS does seem to provide some information about data locality. When >> you make a request for a file to the namenode, you are redirected to the >> datanode containing the first block of that file. >> >> 1) But what happens if you specify an offset in the third block? Are you >> redirected to the datanode containing that block or are you still >> redirected to the datanode containing the file's first block? >> >> 2) Is there any reason that WebHDFS does not support requesting the block >> locations? >> >> 3) Would the HDFS community be interested in a patch that adds support for >> a) reporting block locations and b) enables requesting blocks from the >> appropriate data nodes (if it is not already there)? I believe this would >> be of interest to other projects that are using WebHDFS. >> >> Thank you! >> >> RJ >> >> -- >> em rnowl...@gmail.com >> c 954.496.2314 >> > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. -- em rnowl...@gmail.com c 954.496.2314