Hello, We are running HDFS on 9-node hadoop cluster, hadoop version is 1.2.1. We are using default HDFS block size.
We have noticed that disks of slaves are almost full. From name node’s status page (namenode:50070), we could see that disks of live nodes are 90% full and DFS Used% in cluster summary page is ~1TB. However hadoop dfs -dus / shows that file system size is merely 38GB. 38GB number looks to be correct because we keep only few Hive tables and hadoop’s /tmp (distributed cache and job outputs) in HDFS. All other data is cleaned up. I cross-checked this from hadoop dfs -ls. Also I think that there is no internal fragmentation because the files in our Hive tables are well-chopped in ~50MB chunks. Here are last few lines of hadoop fsck / -files -blocks Status: HEALTHY Total size: 38086441332 B Total dirs: 232 Total files: 802 Total blocks (validated): 796 (avg. block size 47847288 B) Minimally replicated blocks: 796 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 6 (0.75376886 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 3.0439699 Corrupt blocks: 0 Missing replicas: 6 (0.24762692 %) Number of data-nodes: 9 Number of racks: 1 FSCK ended at Sun Apr 13 19:49:23 UTC 2014 in 135 milliseconds My question is that why disks of slaves are getting full even though there are only few files in DFS?