On 8 August 2013 21:51, Matevz Tadel <mta...@ucsd.edu> wrote: > Hi Steve, > > Thank you very much for the reality check! Some more answers inline ... > > > On 8/8/13 1:30 PM, Steve Loughran wrote: > >> On 7 August 2013 10:59, Jeff Dost <jd...@ucsd.edu> wrote: >> >> Hello, >>> >>> We work in a software development team at the UCSD CMS Tier2 Center. We >>> would like to propose a mechanism to allow one to subclass the >>> DFSInputStream in a clean way from an external package. First I'd like >>> to >>> give some motivation on why and then will proceed with the details. >>> >>> We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment at >>> CERN. There are other T2 centers worldwide that contain mirrors of the >>> same data we host. We are working on an extension to Hadoop that, on >>> reading a file, if it is found that there are no available replicas of a >>> block, we use an external interface to retrieve this block of the file >>> from >>> another data center. The external interface is necessary because not all >>> T2 centers involved in CMS are running a Hadoop cluster as their storage >>> backend. >>> >>> >> You are relying on all these T2 site being HDFS clusters with exactly the >> same block sizing of a named file, so that you can pick up a copy of a >> block from elsewhere? >> > > No, it's not just HDFS. But we have a common access method for all sites, > XRootd (http://xrootd.slac.stanford.**edu/<http://xrootd.slac.stanford.edu/>), > and that's what our implementation of input stream falls back to using. >
OK: so any block access issue triggers fallback. > > Also, HDFS block sizes are not the same ... it's in the hands of T2 admins. > > > Why not just handle a failing FileNotFoundException on an open() and >> either >> relay to another site or escalate to T1/T0/Castor? This would be far >> easier >> to implement with a front end wrapper for any FileSystem. What you are >> proposing seems far more brittle in both code and execution. >> > > We already do fallback to xrootd on open failures from our application > framework ... the job gets redirected to xrootd proxy which downloads the > whole file and serves data as the job asks for it. The changes described by > Jeff are an optimization for cases when a single block becomes unavailable. > > Like other said, there's work on heterogenous storage. Maybe you could make sure that there is some handling there for block unavailablity events -then you can hook in that to handle it. > We have a lot of data that is replicated on several sites and also > available on tape at Fermilab. Popularity of datasets (a couple 100 TB) > varies quite a lot and what we would like to achieve is to be able to > reduce replication down to one for files that not many people care for at > the moment. This should give us about 1 PB of diskspace on every T2 center > and allow us to be more proactive with data movement by simply observing > the job queues. > > > In order to implement this functionality, we need to subclass the >>> DFSInputStream and override the read method, so we can catch IOExceptions >>> that occur on client reads at the block level. >>> >>> >> Specific IOEs related to missing blocks, or any IOE -as that could swallow >> other problems? >> > > We were looking at that but couldn't quite figure out what else can go > wrong ... so the plan was to log all the exceptions and see what we can do > about each of them. > network problems, security. If there isn't a specific IOE subclass for block-not-found, there ought to be. > > Thanks also for the comments below, Jeff will appreciate them more as he > actually went through all the HDFS code and did all the changes in the > patch, I did the JNI + Xrootd part. > > One more question ... if we'd converge on an acceptable solution (I find > it a bit hard at the moment), how long would it take for the changes to be > released and what release cycle would it end up in? > all changes go into trunk, if it was go to into the 2.x line then it would be 2.3 at the earliest; 2.1 is going through its beta right now. If you work against 2.1 and report bugs now, that would help the beta and make it easier for you to have a private fork of 2.1.x with the extensions