Hi Matt, I'd also recommend implementing this in a somewhat pluggable way -- eg a configuration for a Deleter class. The default Deleter can be the one we use today which just removes the file, and you could plug in a SecureDeleter. I'd also see some use cases for a Deleter implementation which doesn't actually delete the block, but instead moves it to a local trash directory which is deleted a day or two later. This sort of policy can help recover data as a last ditch effort if there is some kind of accidental deletion and there aren't snapshots in place.
-Todd On Thu, Aug 15, 2013 at 11:50 AM, Andrew Wang <andrew.w...@cloudera.com>wrote: > Hi Matt, > > Here are some code pointers: > > - When doing a file deletion, the NameNode turns the file into a set of > blocks that need to be deleted. > - When datanodes heartbeat in to the NN (see BPServiceActor#offerService), > the NN replies with blocks to be invalidated (see BlockCommand and > DatanodeProtocol.DNA_INVALIDATE). > - The DN processes these invalidates in > BPServiceActor#processCommandFromActive (look for DNA_INVALIDATE again). > - The magic lines you're looking for are probably in > FsDatasetAsyncDiskService#run, since we delete blocks in the background > > Best, > Andrew > > > On Thu, Aug 15, 2013 at 5:31 AM, Matt Fellows < > matt.fell...@bespokesoftware.com> wrote: > > > Hi, > > I'm looking into writing a patch for HDFS which will provide a new method > > within HDFS which can securely delete the contents of a block on all the > > nodes upon which it exists. By securely delete I mean, overwrite with > > 1's/0's/random data cyclically such that the data could not be recovered > > forensically. > > > > I'm not currently aware of any existing code / methods which provide > this, > > so was going to implement this myself. > > > > I figured the DataNode.java was probably the place to start looking into > > how this could be done, so I've read the source for this, but it's not > > really enlightened me a massive amount. > > > > I'm assuming I need to tell the NameServer that all DataNodes with a > > particular block id would be required to be deleted, then as each > DataNode > > calls home, the DataNode would be instructed to securely delete the > > relevant block, and it would oblige. > > > > Unfortunately I have no idea where to begin and was looking for some > > pointers? > > > > I guess specifically I'd like to know: > > > > 1. Where the hdfs CLI commands are implemented > > 2. How a DataNode identifies a block / how a NameServer could inform a > > DataNode to delete a block > > 3. Where the existing "delete" is implemented so I can make sure my > secure > > delete makes use of it after successfully blanking the block contents > > 4. If I've got the right idea about this at all? > > > > Kind regards, > > Matt Fellows > > > > -- > > [image: cid:1CBF4038-3F0F-4FC2-A1FF-6DC81B8B6F94] > > First Option Software Ltd > > Signal House > > Jacklyns Lane > > Alresford > > SO24 9JJ > > Tel: +44 (0)1962 738232 > > Mob: +44 (0)7710 160458 > > Fax: +44 (0)1962 600112 > > Web: www.b <http://www.fosolutions.co.uk/>espokesoftware.com< > http://bespokesoftware.com/> > > > > ______________________________**______________________ > > > > This is confidential, non-binding and not company endorsed - see full > > terms at www.fosolutions.co.uk/**emailpolicy.html< > http://www.fosolutions.co.uk/emailpolicy.html> > > > > First Option Software Ltd Registered No. 06340261 > > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. > > ______________________________**______________________ > > > > > -- Todd Lipcon Software Engineer, Cloudera