Re: Theory question: good values for FileStatus.getBlockSize()

Colin P. McCabe Tue, 17 Feb 2015 14:42:28 -0800

In the past, "block size" and "size of block N" were completely
separate concepts in HDFS.

The former was often referred to as "default block size" or "preferred
block" size or some such thing.  Basically it was the point at which
we'd call it a day and move on to the next block, whenever any block
got to that point.  "default block size" was pretty much always 128MB
or 256MB in Real Clusters (although sometimes Apache Parquet would set
it as high as 1GB).  We got tired of people configuring ridiculously
small block sizes by accident so HDFS-4305 added
dfs.namenode.fs-limits.min-block-size.

In the old world, the only block which could be smaller than the
"default block size" was the final block of a file.  MR used default
block size as a guide to doing partitioning and we sort of ignored the
fact that the last block could be less than that.

Now that HDFS-3689 has been added to branch-2, it is no longer true
that all the blocks are the same size except the last one.  The
ramifications of this are still to be determined.  dfs.blocksize will
still be an upper bound on block size, but it will no longer be a
lower bound.

To answer your specific question: in HDFS, FileStatus#getBlockSize
will return the "preferred block size," not the size of any specific
block.  So it's totally possible that none of the blocks in the file
actually have the size returned in FileStatus#getBlockSize.

The relevant code is here in FSDirectory.java:
> if (node.isFile()) {
>   final INodeFile fileNode = node.asFile();
>   size = fileNode.computeFileSize(snapshot);
>   replication = fileNode.getFileReplication(snapshot);
>   blocksize = fileNode.getPreferredBlockSize();
>   isEncrypted = (feInfo != null) ||
>      (isRawPath && isInAnEZ(INodesInPath.fromINode(node)));
> } else {
>  isEncrypted = isInAnEZ(INodesInPath.fromINode(node));
> }
...
> return new HdfsFileStatus(
> ...
>   blocksize,
> ...
>   );

Probably s3 and the rest of the alternative FS gang should just return
the value of some configuration variable (possibly fs.local.block.size
or dfs.blocksize?).  Even though "preferred block size" is a
completely bogus concept in s3, MapReduce and other frameworks still
use it to calculate splits.  Since s3 never does local reads anyway,
there is no reason to prefer any block size over any other, except in
terms of dividing up the work.

regards,
Colin

On Mon, Feb 16, 2015 at 9:44 AM, Steve Loughran <ste...@hortonworks.com> wrote:
>
> HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0, 
> getFileStatus().getBlockSize() > 0"
>
> this is to stop filesystems (most recently s3a) returning 0 as a block size, 
> which then kills any analytics work that tries to partition the workload by 
> blocksize.
>
> I'm currently changing the markdown text to say
>
> MUST be >0 for a file size >0
> MAY be 0 for a file of size==0.
>
> + the relevant tests to check this.
>
> There's one thing I do want to understand from HDFS first: what about small 
> files.? That is: what does HDFS return as a blocksize if a file is smaller 
> than its block size?
>
> -Steve

Re: Theory question: good values for FileStatus.getBlockSize()

Reply via email to