*Root cause*: Wrong FSImage format when user killed hdfs process. It may read invalid block number, may be 1 billion or more, OutOfMemoryError happens before EOFException.
How can we provide the validity of FSImage file? --regards Denny Ye On Tue, Jun 28, 2011 at 4:44 PM, mac fang <mac.had...@gmail.com> wrote: > Hi, Team, > > What we found when we use the Hadoop is, the FSImage often currupts when we > do start/stop the Hadoop cluster. The reason we think might be around the > write to the outputstream: the NameNode may be killed when it > saveNamespace, > then the FsImage file doesn't complete writing. Currently i saw a > previous.checkpoint folder, the logic of saveNamespace is like: > > 1. mv the current folder to the previous.checkpoint folder. > 2. start to write the FSImage into the current folder. > > I think there mightbe a case if the FSImage is currupted, the NameNode can > NOT be started, but we can NOT get any EOFException, since we might > encounter the OutofMemory exception if we read the wrong numBlocks and > instantiate the Blocks [] blocks = new Blocks[numBlocks] (actually, we face > this issue). > > Any suggestion to it? > > thanks > macf >