In the past, "block size" and "size of block N" were completely separate concepts in HDFS.
The former was often referred to as "default block size" or "preferred block" size or some such thing. Basically it was the point at which we'd call it a day and move on to the next block, whenever any block got to that point. "default block size" was pretty much always 128MB or 256MB in Real Clusters (although sometimes Apache Parquet would set it as high as 1GB). We got tired of people configuring ridiculously small block sizes by accident so HDFS-4305 added dfs.namenode.fs-limits.min-block-size. In the old world, the only block which could be smaller than the "default block size" was the final block of a file. MR used default block size as a guide to doing partitioning and we sort of ignored the fact that the last block could be less than that. Now that HDFS-3689 has been added to branch-2, it is no longer true that all the blocks are the same size except the last one. The ramifications of this are still to be determined. dfs.blocksize will still be an upper bound on block size, but it will no longer be a lower bound. To answer your specific question: in HDFS, FileStatus#getBlockSize will return the "preferred block size," not the size of any specific block. So it's totally possible that none of the blocks in the file actually have the size returned in FileStatus#getBlockSize. The relevant code is here in FSDirectory.java: > if (node.isFile()) { > final INodeFile fileNode = node.asFile(); > size = fileNode.computeFileSize(snapshot); > replication = fileNode.getFileReplication(snapshot); > blocksize = fileNode.getPreferredBlockSize(); > isEncrypted = (feInfo != null) || > (isRawPath && isInAnEZ(INodesInPath.fromINode(node))); > } else { > isEncrypted = isInAnEZ(INodesInPath.fromINode(node)); > } ... > return new HdfsFileStatus( > ... > blocksize, > ... > ); Probably s3 and the rest of the alternative FS gang should just return the value of some configuration variable (possibly fs.local.block.size or dfs.blocksize?). Even though "preferred block size" is a completely bogus concept in s3, MapReduce and other frameworks still use it to calculate splits. Since s3 never does local reads anyway, there is no reason to prefer any block size over any other, except in terms of dividing up the work. regards, Colin On Mon, Feb 16, 2015 at 9:44 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > HADOOP-11601 tightens up the filesystem spec by saying "if len(file) > 0, > getFileStatus().getBlockSize() > 0" > > this is to stop filesystems (most recently s3a) returning 0 as a block size, > which then kills any analytics work that tries to partition the workload by > blocksize. > > I'm currently changing the markdown text to say > > MUST be >0 for a file size >0 > MAY be 0 for a file of size==0. > > + the relevant tests to check this. > > There's one thing I do want to understand from HDFS first: what about small > files.? That is: what does HDFS return as a blocksize if a file is smaller > than its block size? > > -Steve