[ https://issues.apache.org/jira/browse/HDFS-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangyi Zhu resolved HDFS-16019. -------------------------------- Resolution: Later > HDFS: Inode CheckPoint > ----------------------- > > Key: HDFS-16019 > URL: https://issues.apache.org/jira/browse/HDFS-16019 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode > Affects Versions: 3.3.1 > Reporter: Xiangyi Zhu > Assignee: Xiangyi Zhu > Priority: Major > > *background* > The OIV IMAGE analysis tool has brought us many benefits, such as file size > distribution, cold and hot data, abnormal growth directory analysis. But in > my opinion he is too slow, especially the big IMAGE. > After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is > necessary to load the entire IMAGE into the memory to output the inode > information into a text format. For large IMAGE, this process takes a long > time and consumes more resources and requires a large memory machine to > analyze. > Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to > get the old version of IMAGE through CheckPoint. The old IMAGE parsing does > not require too many resources, but we need to parse the IMAGE again through > the hdfs oiv_legacy command to get the text information of the Inode, which > is relatively time-consuming. > ** > *Solution* > We can ask the standby node to periodically check the Inode and serialize the > Inode in text mode. For OutPut, different FileSystems can be used according > to the configuration, such as the local file system or the HDFS file system. > The advantage of providing HDFS file system is that we can analyze Inode > directly through spark/hive. I think the block information corresponding to > the Inode may not be of much use. The size of the file and the number of > copies are more useful to us. > In addition, the sequential output of the Inode is not necessary. We can > speed up the CheckPoint for the Inode, and use the partition for the > serialized Inode to output different files. Use a production thread to put > Inode in the Queue, and use multi-threaded consumption Queue to write to > different partition files. For output files, compression can also be used to > reduce disk IO. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org