Re: question about HadoopFsRelation

Koert Kuipers Sun, 25 Oct 2015 08:15:06 -0700

thanks i will read up on that

On Sat, Oct 24, 2015 at 12:53 PM, Ted Yu <yuzhih...@gmail.com> wrote:


> The code below was introduced by SPARK-7673 / PR #6225
>
> See item #1 in the description of the PR.
>
> Cheers
>
> On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> the code that seems to flatMap directories to all the files inside is in
>> the private HadoopFsRelation.buildScan:
>>
>>     // First assumes `input` is a directory path, and tries to get all
>> files contained in it.
>>       fileStatusCache.leafDirToChildrenFiles.getOrElse(
>>         path,
>>         // Otherwise, `input` might be a file path
>>         fileStatusCache.leafFiles.get(path).toArray
>>
>> does anyone know why we want to get all the files when all hadoop
>> inputformats can handle directories (and automatically get the files
>> inside), and the recommended way of doing this in map-red is to pass in
>> directories (to avoid the overhead and very large serialized jobconfs)?
>>
>>
>> On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> i noticed in the comments for HadoopFsRelation.buildScan it says:
>>>   * @param inputFiles For a non-partitioned relation, it contains paths
>>> of all data files in the
>>>    *        relation. For a partitioned relation, it contains paths of
>>> all data files in a single
>>>    *        selected partition.
>>>
>>> do i understand it correctly that it actually lists all the data files
>>> (part files), not just data directories that contain the files?
>>> if so,that sounds like trouble to me, because most implementations will
>>> use this info to set the input paths for FileInputFormat. for example in
>>> ParquetRelation:
>>> if (inputFiles.nonEmpty) {
>>>       FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
>>>     }
>>>
>>> if all the part files are listed this will make the jobConf huge and
>>> could cause issues upon serialization and/or broadcasting.
>>>
>>> it can also lead to other inefficiencies, for example spark-avro creates
>>> a RDD for every input (part) file, which quickly leads to thousands of RDDs.
>>>
>>> i think instead of files only the directories should be listed in the
>>> input path?
>>>
>>
>>
>

Re: question about HadoopFsRelation

Reply via email to