HI, Todd, we use the 0.21 version. I think we used the 'kill -9'. The possible timing is when startup or checkpoint.
regards macf On Tue, Jun 28, 2011 at 11:03 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Denny, > > Which version of Hadoop are you using, and when are you killing the > NameNode? Are you using a unix signal (eg kill -9) or killing power to the > whole machine? > > Thanks > -Todd > > On Tue, Jun 28, 2011 at 2:11 AM, Denny Ye <denny...@gmail.com> wrote: > > > *Root cause*: Wrong FSImage format when user killed hdfs process. It may > > read invalid block > > number, may be 1 billion or more, OutOfMemoryError happens before > > EOFException. > > > > How can we provide the validity of FSImage file? > > > > --regards > > Denny Ye > > > > On Tue, Jun 28, 2011 at 4:44 PM, mac fang <mac.had...@gmail.com> wrote: > > > > > Hi, Team, > > > > > > What we found when we use the Hadoop is, the FSImage often currupts > when > > we > > > do start/stop the Hadoop cluster. The reason we think might be around > the > > > write to the outputstream: the NameNode may be killed when it > > > saveNamespace, > > > then the FsImage file doesn't complete writing. Currently i saw a > > > previous.checkpoint folder, the logic of saveNamespace is like: > > > > > > 1. mv the current folder to the previous.checkpoint folder. > > > 2. start to write the FSImage into the current folder. > > > > > > I think there mightbe a case if the FSImage is currupted, the NameNode > > can > > > NOT be started, but we can NOT get any EOFException, since we might > > > encounter the OutofMemory exception if we read the wrong numBlocks and > > > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we > > face > > > this issue). > > > > > > Any suggestion to it? > > > > > > thanks > > > macf > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >