Hi Gabi, I'd be a bit scared of that backup strategy; what happens if the TCP connection gets cut suddenly during curl? What happens if there's a TCP corruption? Such things have happened before.
Personally, we have the SNN merge the edits every 15 minutes. If it hasn't happened in 30 minutes, people get emailed. If it doesn't happen in 45 minutes, people get paged. In addition to writing out copies to a few disks and to NFS, we also have a versioned backup of the checkpoint.prev. The worst case scenario would be if the SNN corrupts the image and uploads the corrupt image (it's a theoretical situation so far...); this would be caught at the next merge, meaning we trash up to 30 minutes of work. This would ruin someone's day, but not someone's week. The NN is a SPOF, and should be treated with an appropriate level of paranoia (and, because it is a SPOF, assume that it will fail anyway and make sure you can accept the consequences). Brian On Sep 22, 2011, at 3:48 AM, Gabi Kazav wrote: > Hi, > > I had Power Failure. > I have backup of files: edits, fsimage. > > I am backing it up with: > > curl -s http://nameNode:50070/getimage?getimage=1 > fsimage > curl -s http://nameNode:50070/getimage?getedits=1 > edits > > When I am trying to start the HDFS with the recovered files, I got error > about the edits file : "Error replaying edit log at offset 1921" > > Also, I have edits.new file, when I rename it to edits I got: "ERROR > org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at > offset 2494103" > > What is the problem?! > > > And from now on, how can I do a backup that works?! :) > > Thanks, > Gabi. > > > > > Gabi Kazav > IT Manager And Infrastructure Engineer > gabi.ka...@pursway.com<mailto:gabi.ka...@pursway.com> | > www.pursway.com<http://www.pursway.com/> > Mailing address PO Box 4125, Herzliya 46140 > Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736 >