thanks i will read up on that On Sat, Oct 24, 2015 at 12:53 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> The code below was introduced by SPARK-7673 / PR #6225 > > See item #1 in the description of the PR. > > Cheers > > On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> the code that seems to flatMap directories to all the files inside is in >> the private HadoopFsRelation.buildScan: >> >> // First assumes `input` is a directory path, and tries to get all >> files contained in it. >> fileStatusCache.leafDirToChildrenFiles.getOrElse( >> path, >> // Otherwise, `input` might be a file path >> fileStatusCache.leafFiles.get(path).toArray >> >> does anyone know why we want to get all the files when all hadoop >> inputformats can handle directories (and automatically get the files >> inside), and the recommended way of doing this in map-red is to pass in >> directories (to avoid the overhead and very large serialized jobconfs)? >> >> >> On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> i noticed in the comments for HadoopFsRelation.buildScan it says: >>> * @param inputFiles For a non-partitioned relation, it contains paths >>> of all data files in the >>> * relation. For a partitioned relation, it contains paths of >>> all data files in a single >>> * selected partition. >>> >>> do i understand it correctly that it actually lists all the data files >>> (part files), not just data directories that contain the files? >>> if so,that sounds like trouble to me, because most implementations will >>> use this info to set the input paths for FileInputFormat. for example in >>> ParquetRelation: >>> if (inputFiles.nonEmpty) { >>> FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*) >>> } >>> >>> if all the part files are listed this will make the jobConf huge and >>> could cause issues upon serialization and/or broadcasting. >>> >>> it can also lead to other inefficiencies, for example spark-avro creates >>> a RDD for every input (part) file, which quickly leads to thousands of RDDs. >>> >>> i think instead of files only the directories should be listed in the >>> input path? >>> >> >> >