Hi Jeff,

Do you need to subclass or could you simply wrap? Generally composition as
opposed to inheritance is a lot safer way of integrating software written
by different parties, since inheritance exposes all the implementation
details which are subject to change.

-Todd

On Wed, Aug 7, 2013 at 10:59 AM, Jeff Dost <jd...@ucsd.edu> wrote:

> Hello,
>
> We work in a software development team at the UCSD CMS Tier2 Center.  We
> would like to propose a mechanism to allow one to subclass the
> DFSInputStream in a clean way from an external package.  First I'd like to
> give some motivation on why and then will proceed with the details.
>
> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at
> CERN.  There are other T2 centers worldwide that contain mirrors of the
> same data we host.  We are working on an extension to Hadoop that, on
> reading a file, if it is found that there are no available replicas of a
> block, we use an external interface to retrieve this block of the file from
> another data center.  The external interface is necessary because not all
> T2 centers involved in CMS are running a Hadoop cluster as their storage
> backend.
>
> In order to implement this functionality, we need to subclass the
> DFSInputStream and override the read method, so we can catch IOExceptions
> that occur on client reads at the block level.
>
> The basic steps required:
> 1. Invent a new URI scheme for the customized "FileSystem" in
> core-site.xml:
>   <property>
>     <name>fs.foofs.impl</name>
>     <value>my.package.**FooFileSystem</value>
>     <description>My Extended FileSystem for foofs: uris.</description>
>   </property>
>
> 2. Write new classes included in the external package that subclass the
> following:
> FooFileSystem subclasses DistributedFileSystem
> FooFSClient subclasses DFSClient
> FooFSInputStream subclasses DFSInputStream
>
> Now any client commands that explicitly use the foofs:// scheme in paths
> to access the hadoop cluster can open files with a customized InputStream
> that extends functionality of the default hadoop client DFSInputStream.  In
> order to make this happen for our use case, we had to change some access
> modifiers in the DistributedFileSystem, DFSClient, and DFSInputStream
> classes provided by Hadoop.  In addition, we had to comment out the check
> in the namenode code that only allows for URI schemes of the form "hdfs://".
>
> Attached is a patch file we apply to hadoop.  Note that we derived this
> patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be
> found at:
> http://archive.cloudera.com/**cdh4/cdh/4/hadoop-2.0.0-cdh4.**1.1.tar.gz<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz>
>
> We would greatly appreciate any advise on whether or not this approach
> sounds reasonable, and if you would consider accepting these modifications
> into the official Hadoop code base.
>
> Thank you,
> Jeff, Alja & Matevz
> UCSD Physics
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to