i noticed in the comments for HadoopFsRelation.buildScan it says: * @param inputFiles For a non-partitioned relation, it contains paths of all data files in the * relation. For a partitioned relation, it contains paths of all data files in a single * selected partition.
do i understand it correctly that it actually lists all the data files (part files), not just data directories that contain the files? if so,that sounds like trouble to me, because most implementations will use this info to set the input paths for FileInputFormat. for example in ParquetRelation: if (inputFiles.nonEmpty) { FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*) } if all the part files are listed this will make the jobConf huge and could cause issues upon serialization and/or broadcasting. it can also lead to other inefficiencies, for example spark-avro creates a RDD for every input (part) file, which quickly leads to thousands of RDDs. i think instead of files only the directories should be listed in the input path?