Thanks Colin, I subscribed to HDFS-2832 so that I can follow the development there. I assume this is targeting release 2.1.

Best,
Matevz


On 08/08/13 12:10, Colin McCabe wrote:
There is work underway to decouple the block layer and the namespace
layer of HDFS from each other.  Once this is done, block behaviors
like the one you describe will be easy to implement.  It's a use case
very similar to the hierarchical storage management (HSM) use case
that we've discussed before.  Check out HDFS-2832.  Hopefully there
will be a design document posted soon.

cheers,
Colin


On Thu, Aug 8, 2013 at 11:52 AM, Matevz Tadel <mta...@ucsd.edu> wrote:
Hi everybody,

I'm jumping in as Jeff is away due to an unexpected annoyance involving
Californian wildlife.


On 8/7/13 7:47 PM, Andrew Wang wrote:

Blocks are supposed to be an internal abstraction within HDFS, and aren't
an
inherent part of FileSystem (the user-visible class used to access all
Hadoop
filesystems).


Yes, but it's a really useful abstraction :) Do you really believe the
blocks could be abandoned in the next couple of years? I mean, it's such a
simple and effective solution ...


Is it possible to instead deal with files and offsets? On a read failure,
you
could open a stream to the same file on the backup filesystem, seek to the
old
file position, and retry the read. This feels like it's possible via
wrapping.


As Jeff briefly mentioned, all USCMS sites export their data via XRootd (not
all of them use HDFS!) and we developed a specialization of XRootd caching
proxy that can fetch only requested blocks (block size is passed from our
input stream class to XRootd client (via JNI) and on to the proxy server)
and keep them in a local cache. This allows as to do three things:

1. the first time we notice a block is missing, a whole block is fetched
from elsewhere and further access requests from the same process get
fulfilled with zero latency;

2. later requests from other processes asking for this block are fulfilled
immediately (well, after the initial 3 retries);

3. we have a list of blocks that were fetched and we could (this is what we
want to look into in the near future) re-inject them into HDFS if the data
loss turns out to be permanent (bad disk vs. node that was
offline/overloaded for a while).

Handling exceptions at the block level thus gives us just what we need. As
input stream is the place where these errors become known it is, I think,
also the easiest place to handle them.

I'll understand if you find opening-up of the interfaces in the central
repository unacceptable. We can always apply the patch at the OSG level
where rpms for all our deployments get built.

Thanks & Best regards,
Matevz


On Wed, Aug 7, 2013 at 3:29 PM, Jeff Dost <jd...@ucsd.edu
<mailto:jd...@ucsd.edu>> wrote:

     Thank you for the suggestion, but we don't see how simply wrapping a
     FileSystem object would be sufficient in our use case.  The reason why
is we
     need to catch and handle read exceptions at the block level.  There
aren't
     any public methods available in the high level FileSystem abstraction
layer
     that would give us the fine grained control we need at block level
read
     failures.

     Perhaps if I outline the steps more clearly it will help explain what
we are
     trying to do.  Without our enhancements, suppose a user opens a file
stream
     and starts reading the file from Hadoop. After some time, at some
position
     into the file, if there happen to be no replicas available for a
particular
     block for whatever reason, datanodes have gone down due to disk
issues, etc.
     the stream will throw an IOException (BlockMissingException or
similar) and
     the read will fail.

     What we are doing is rather than letting the stream fail, we have
another
     stream queued up that knows how to fetch the blocks elsewhere outside
of our
     Hadoop cluster that couldn't be retrieved.  So we need to be able to
catch
     the exception at this point, and these externally fetched bytes then
get
     read into the user supplied read buffer.  Now Hadoop can proceed to
read in
     the stream the next blocks in the file.

     So as you can see this method of fail over on demand allows an input
stream
     to keep reading data, without having to start it all over again if a
failure
     occurs (assuming the remote bytes were successfully fetched).

     As a final note I would like to mention that we will be providing our
     failover module to the Open Science Grid.  Since we hope to provide
this as
     a benefit to all OSG users running at participating T2 computing
clusters,
     we will be committed to maintaining this software and any changes to
Hadoop
     needed to make it work.  In other words we will be willing to maintain
any
     implementation changes that may become necessary as Hadoop internals
change
     in future releases.

     Thanks,
     Jeff


     On 8/7/13 11:30 AM, Andrew Wang wrote:

         I don't think exposing DFSClient and DistributedFileSystem members
is
         necessary to achieve what you're trying to do. We've got wrapper
         FileSystems like FilterFileSystem and ViewFileSystem which you
might be
         able to use for inspiration, and the HCFS wiki lists some
third-party
         FileSystems that might also be helpful too.


         On Wed, Aug 7, 2013 at 11:11 AM, Joe Bounour <jboun...@ddn.com
         <mailto:jboun...@ddn.com>> wrote:

             Hello Jeff

             Is it something that could go under HCFS project?
             http://wiki.apache.org/hadoop/__HCFS

             <http://wiki.apache.org/hadoop/HCFS>
             (I might be wrong?)

             Joe


             On 8/7/13 10:59 AM, "Jeff Dost" <jd...@ucsd.edu
             <mailto:jd...@ucsd.edu>> wrote:

                 Hello,

                 We work in a software development team at the UCSD CMS
Tier2
                 Center.  We
                 would like to propose a mechanism to allow one to subclass
the
                 DFSInputStream in a clean way from an external package.
First
                 I'd like
                 to give some motivation on why and then will proceed with
the
                 details.

                 We have a 3 Petabyte Hadoop cluster we maintain for the
LHC
                 experiment
                 at CERN.  There are other T2 centers worldwide that
contain
                 mirrors of
                 the same data we host.  We are working on an extension to
Hadoop
                 that,
                 on reading a file, if it is found that there are no
available
                 replicas
                 of a block, we use an external interface to retrieve this
block
                 of the
                 file from another data center.  The external interface is
necessary
                 because not all T2 centers involved in CMS are running a
Hadoop
                 cluster
                 as their storage backend.

                 In order to implement this functionality, we need to
subclass the
                 DFSInputStream and override the read method, so we can
catch
                 IOExceptions that occur on client reads at the block
level.

                 The basic steps required:
                 1. Invent a new URI scheme for the customized "FileSystem"
in
                 core-site.xml:
                     <property>
                       <name>fs.foofs.impl</name>
                       <value>my.package.__FooFileSystem</value>

                       <description>My Extended FileSystem for foofs:
                 uris.</description>
                     </property>

                 2. Write new classes included in the external package that
                 subclass the
                 following:
                 FooFileSystem subclasses DistributedFileSystem
                 FooFSClient subclasses DFSClient
                 FooFSInputStream subclasses DFSInputStream

                 Now any client commands that explicitly use the foofs://
scheme
                 in paths
                 to access the hadoop cluster can open files with a
customized
                 InputStream that extends functionality of the default
hadoop client
                 DFSInputStream.  In order to make this happen for our use
case,
                 we had
                 to change some access modifiers in the
DistributedFileSystem,
                 DFSClient,
                 and DFSInputStream classes provided by Hadoop.  In
addition, we
                 had to
                 comment out the check in the namenode code that only
allows for URI
                 schemes of the form "hdfs://".

                 Attached is a patch file we apply to hadoop.  Note that we
                 derived this
                 patch by modding the Cloudera release
hadoop-2.0.0-cdh4.1.1
                 which can be
                 found at:

http://archive.cloudera.com/__cdh4/cdh/4/hadoop-2.0.0-cdh4.__1.1.tar.gz


<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz>

                 We would greatly appreciate any advise on whether or not
this
                 approach
                 sounds reasonable, and if you would consider accepting
these
                 modifications into the official Hadoop code base.

                 Thank you,
                 Jeff, Alja & Matevz
                 UCSD Physics






Reply via email to