question about HadoopFsRelation

Koert Kuipers Fri, 23 Oct 2015 21:24:08 -0700

i noticed in the comments for HadoopFsRelation.buildScan it says:
  * @param inputFiles For a non-partitioned relation, it contains paths of
all data files in the
   *        relation. For a partitioned relation, it contains paths of all
data files in a single
   *        selected partition.


do i understand it correctly that it actually lists all the data files
(part files), not just data directories that contain the files?
if so,that sounds like trouble to me, because most implementations will use
this info to set the input paths for FileInputFormat. for example in
ParquetRelation:
if (inputFiles.nonEmpty) {
      FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
    }

if all the part files are listed this will make the jobConf huge and could
cause issues upon serialization and/or broadcasting.

it can also lead to other inefficiencies, for example spark-avro creates a
RDD for every input (part) file, which quickly leads to thousands of RDDs.

i think instead of files only the directories should be listed in the input
path?

question about HadoopFsRelation

Reply via email to