Gera Shegalov created HADOOP-12077:
--------------------------------------

             Summary: Provide a muti-URI replication Inode for ViewFs
                 Key: HADOOP-12077
                 URL: https://issues.apache.org/jira/browse/HADOOP-12077
             Project: Hadoop Common
          Issue Type: New Feature
          Components: fs
            Reporter: Gera Shegalov
            Assignee: Gera Shegalov


This JIRA is to provide simple "replication" capabilities for applications that 
maintain logically equivalent paths in multiple locations for caching or 
failover (e.g., S3 and HDFS). We noticed a simple common HDFS usage pattern in 
our applications. They host their data on some logical cluster C. There are 
corresponding HDFS clusters in multiple datacenters. When the application runs 
in DC1, it prefers to read from C in DC1, and the applications prefers to 
failover to C in DC2 if the application is migrated to DC2 or when C in DC1 is 
unavailable. New application data versions are created periodically/relatively 
infrequently. 

In order to address many common scenarios in a general fashion, and to avoid 
unnecessary code duplication, we implement this functionality in ViewFs (our 
default FileSystem spanning all clusters in all datacenters) in a project 
code-named Nfly (N as in N datacenters). Currently each ViewFs Inode points to 
a single URI via ChRootedFileSystem. Consequently, we introduce a new type of 
links that points to a list of URIs that are each going to be wrapped in 
ChRootedFileSystem. A typical usage: /nfly/C/user->/DC1/C/user,/DC2/C/user,... 
This collection of ChRootedFileSystem instances is fronted by the Nfly 
filesystem object that is actually used for the mount point/Inode. Nfly 
filesystems backs a single logical path /nfly/C/user/<user>/path by multiple 
physical paths.

Nfly filesystem supports setting minReplication. As long as the number of URIs 
on which an update has succeeded is greater than or equal to minReplication 
exceptions are only logged but not thrown. Each update operation is currently 
executed serially (client-bandwidth driven parallelism will be added later). 

A file create/write: 
# Creates a temporary invisible _nfly_tmp_file in the intended chrooted 
filesystem. 
# Returns a FSDataOutputStream that wraps output streams returned by 1
# All writes are forwarded to each output stream.
# On close of stream created by 2, all n streams are closed, and the files are 
renamed from _nfly_tmp_file to file. All files receive the same mtime 
corresponding to the client system time as of beginning of this step. 
# If at least minReplication destinations has gone through steps 1-4 without 
failures the transaction is considered logically committed, otherwise a 
best-effort attempt of cleaning up the temporary files is attempted.

As for reads, we support a notion of locality similar to HDFS  /DC/rack/node. 
We sort Inode URIs using NetworkTopology by their authorities. These are 
typically host names in simple HDFS URIs. If the authority is missing as is the 
case with the local file:/// the local host name is assumed 
InetAddress.getLocalHost(). This makes sure that the local file system is 
always the closest one to the reader in this approach. For our Hadoop 2 hdfs 
URIs that are based on nameservice ids instead of hostnames it is very easy to 
adjust the topology script since our nameservice ids already contain the 
datacenter. As for rack and node we can simply output any string such as 
/DC/rack-nsid/node-nsid, since we only care about datacenter-locality for such 
filesystem clients.

There are 2 policies/additions to the read call path that makes it more 
expensive, but improve user experience:
- readMostRecent - when this policy is enabled, Nfly first checks mtime for the 
path under all URIs, sorts them from most recent to least recent. Nfly then 
sorts the set of most recent URIs topologically in the same manner as described 
above.
- repairOnRead - when readMostRecent is enabled Nfly already has to RPC all 
underlying destinations. With repairOnRead, Nfly filesystem would additionally 
attempt to refresh destinations with the path missing or a stale version of the 
path using the nearest available most recent destination. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to