zhu created HDFS-16019: -------------------------- Summary: HDFS: Inode CheckPoint Key: HDFS-16019 URL: https://issues.apache.org/jira/browse/HDFS-16019 Project: Hadoop HDFS Issue Type: Improvement Components: namanode Affects Versions: 3.3.1 Reporter: zhu Assignee: zhu
*background* The OIV IMAGE analysis tool has brought us many benefits, such as file size distribution, cold and hot data, abnormal growth directory analysis. But in my opinion he is too slow, especially the big IMAGE. After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is necessary to load the entire IMAGE into the memory to output the inode information into a text format. For large IMAGE, this process takes a long time and consumes more resources and requires a large memory machine to analyze. Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to get the old version of IMAGE through CheckPoint. The old IMAGE parsing does not require too many resources, but we need to parse the IMAGE again through the hdfs oiv_legacy command to get the text information of the Inode, which is relatively time-consuming. ** *Solution* We can ask the standby node to periodically check the Inode and serialize the Inode in text mode. For OutPut, different FileSystems can be used according to the configuration, such as the local file system or the HDFS file system. The advantage of providing HDFS file system is that we can analyze Inode directly through spark/hive. I think the block information corresponding to the Inode may not be of much use. The size of the file and the number of copies are more useful to us. In addition, the sequential output of the Inode is not necessary. We can speed up the CheckPoint for the Inode, and use the partition for the serialized Inode to output different files. Use a production thread to put Inode in the Queue, and use multi-threaded consumption Queue to write to different partition files. For output files, compression can also be used to reduce disk IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org