Hi Jeff, Do you need to subclass or could you simply wrap? Generally composition as opposed to inheritance is a lot safer way of integrating software written by different parties, since inheritance exposes all the implementation details which are subject to change.
-Todd On Wed, Aug 7, 2013 at 10:59 AM, Jeff Dost <jd...@ucsd.edu> wrote: > Hello, > > We work in a software development team at the UCSD CMS Tier2 Center. We > would like to propose a mechanism to allow one to subclass the > DFSInputStream in a clean way from an external package. First I'd like to > give some motivation on why and then will proceed with the details. > > We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at > CERN. There are other T2 centers worldwide that contain mirrors of the > same data we host. We are working on an extension to Hadoop that, on > reading a file, if it is found that there are no available replicas of a > block, we use an external interface to retrieve this block of the file from > another data center. The external interface is necessary because not all > T2 centers involved in CMS are running a Hadoop cluster as their storage > backend. > > In order to implement this functionality, we need to subclass the > DFSInputStream and override the read method, so we can catch IOExceptions > that occur on client reads at the block level. > > The basic steps required: > 1. Invent a new URI scheme for the customized "FileSystem" in > core-site.xml: > <property> > <name>fs.foofs.impl</name> > <value>my.package.**FooFileSystem</value> > <description>My Extended FileSystem for foofs: uris.</description> > </property> > > 2. Write new classes included in the external package that subclass the > following: > FooFileSystem subclasses DistributedFileSystem > FooFSClient subclasses DFSClient > FooFSInputStream subclasses DFSInputStream > > Now any client commands that explicitly use the foofs:// scheme in paths > to access the hadoop cluster can open files with a customized InputStream > that extends functionality of the default hadoop client DFSInputStream. In > order to make this happen for our use case, we had to change some access > modifiers in the DistributedFileSystem, DFSClient, and DFSInputStream > classes provided by Hadoop. In addition, we had to comment out the check > in the namenode code that only allows for URI schemes of the form "hdfs://". > > Attached is a patch file we apply to hadoop. Note that we derived this > patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be > found at: > http://archive.cloudera.com/**cdh4/cdh/4/hadoop-2.0.0-cdh4.**1.1.tar.gz<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz> > > We would greatly appreciate any advise on whether or not this approach > sounds reasonable, and if you would consider accepting these modifications > into the official Hadoop code base. > > Thank you, > Jeff, Alja & Matevz > UCSD Physics > -- Todd Lipcon Software Engineer, Cloudera