Re: HDFS data block clarification

jason hadoop Thu, 02 Apr 2009 19:30:42 -0700

HDFS only allocates as much physical disk space is required for a block, up
to the block size for the file (+ some header data).
So if you write a 4k file, the single block for that file will be around 4k.

If you write a 65M file, there will be two blocks, one of roughly 64M, and
one of roughly 1M.

You can verify this yourself by, on a datanode, running *find
${dfs.data.dir} -iname blk'*' -type f -ls*

Note: the above command will only work as expected if a single directory is
defined for dfs block storage, and ${dfs.data.dir}, is replaced with the
effective value of the configuration parameter dfs.data.dir, from your
hadoop configuration.
dfs.data.dir is commonly defined as ${hadoop.tmp.dir}/dfs/data.

The following rather insane bash shell command will print out the value of
dfs.data.dir on the local machine.
It must be run from the hadoop installation directory, and makes 2 temporary
names in /tmp/f.PID.input and /tmp/f.PID.output
This little ditty relies on the fact that the configuration parameters are
pushed into the process environment for streaming jobs.

Streaming Rocks!

B=/tmp/f.$$;
date > ${B}.input;
rmdir ${B}.output;
bin/hadoop jar contrib/streaming/hadoop-*-streaming.jar -D
fs.default.name=file:///
-jt local -input ${B}.input -output ${B}.output -numReduceTasks 0 -mapper
env;
grep dfs.data.dir ${B}.output/part-00000;
rm ${B}.input;
rm -rf ${B}.output

On Thu, Apr 2, 2009 at 6:44 PM, javateck javateck <[email protected]>wrote:

>  Can someone tell whether a file will occupy one or more blocks? for
> example, the default block size is 64MB, and if I save a 4k file to HDFS,
> will the 4K file occupy the whole 64MB block alone? so in this case, do I
> do
> need to configure the block size to 10k if most of my files are less than
> 10K?
>
> thanks,
>

-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: HDFS data block clarification

Reply via email to