Re: corrupted edits log after power failure

Brian Bockelman Thu, 22 Sep 2011 12:16:08 -0700

Hi Gabi,

I'd be a bit scared of that backup strategy; what happens if the TCP connection 
gets cut suddenly during curl?  What happens if there's a TCP corruption?  Such 
things have happened before.

Personally, we have the SNN merge the edits every 15 minutes.  If it hasn't 
happened in 30 minutes, people get emailed.  If it doesn't happen in 45 
minutes, people get paged.

In addition to writing out copies to a few disks and to NFS, we also have a 
versioned backup of the checkpoint.prev.

The worst case scenario would be if the SNN corrupts the image and uploads the 
corrupt image (it's a theoretical situation so far...); this would be caught at 
the next merge, meaning we trash up to 30 minutes of work.  This would ruin 
someone's day, but not someone's week.

The NN is a SPOF, and should be treated with an appropriate level of paranoia 
(and, because it is a SPOF, assume that it will fail anyway and make sure you 
can accept the consequences).

Brian

On Sep 22, 2011, at 3:48 AM, Gabi Kazav wrote:

> Hi,
> 
> I had Power Failure.
> I have backup of files: edits, fsimage.
> 
> I am backing it up with:
> 
> curl -s http://nameNode:50070/getimage?getimage=1 > fsimage
> curl -s http://nameNode:50070/getimage?getedits=1 > edits
> 
> When I am trying to start the HDFS with the recovered files, I got error 
> about the edits file : "Error replaying edit log at offset 1921"
> 
> Also, I have edits.new file, when I rename it to edits I got: "ERROR 
> org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log at 
> offset 2494103"
> 
> What is the problem?!
> 
> 
> And from now on, how can I do a backup that works?! :)
> 
> Thanks,
> Gabi.
> 
> 
> 
> 
> Gabi Kazav
> IT Manager And Infrastructure Engineer
> gabi.ka...@pursway.com<mailto:gabi.ka...@pursway.com> | 
> www.pursway.com<http://www.pursway.com/>
> Mailing address PO Box 4125, Herzliya 46140
> Address 8 Hamada St., Herzliya, IL | Tel +972 527 772457| Fax + 972 9 958 4736
>

Re: corrupted edits log after power failure

Reply via email to