Re: Documentation on "Automatic file coalescing for native data sources"?

Takeshi Yamamuro Sat, 20 May 2017 08:54:34 -0700

I think this document points to a logic here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418


This logic merge small files into a partition and you can control this
threshold via `spark.sql.files.maxPartitionBytes`.

// maropu


On Sat, May 20, 2017 at 8:15 AM, ayan guha <guha.a...@gmail.com> wrote:

> I think like all other read operations, it is driven by input format used,
> and I think some variation of combine file input format is used by default.
> I think you can test it by force a particular input format which gets ine
> file per split, then you should end up with same number of partitions as
> your dsta files
>
> On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark....@gmail.com>
> wrote:
>
>> Hey all,
>>
>> A reply on this would be great!
>>
>> Thanks,
>> A.B.
>>
>> On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegm...@securityscorecard.io>
>> wrote:
>>
>>> When using spark.read on a large number of small files, these are
>>> automatically coalesced into fewer partitions. The only documentation I can
>>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>>
>>> "Automatic file coalescing for native data sources"
>>>
>>> Can anyone point me to documentation explaining what triggers this
>>> feature, how it decides how many partitions to coalesce to, and what counts
>>> as a "native data source"? I couldn't find any mention of this feature in
>>> the SQL Programming Guide and Google was not helpful.
>>>
>>> --
>>> Daniel Siegmann
>>> Senior Software Engineer
>>> *SecurityScorecard Inc.*
>>> 214 W 29th Street, 5th Floor
>>> New York, NY 10001
>>>
>>> --
> Best Regards,
> Ayan Guha
>



-- 
---
Takeshi Yamamuro

Re: Documentation on "Automatic file coalescing for native data sources"?

Reply via email to