I think this document points to a logic here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418
This logic merge small files into a partition and you can control this threshold via `spark.sql.files.maxPartitionBytes`. // maropu On Sat, May 20, 2017 at 8:15 AM, ayan guha <guha.a...@gmail.com> wrote: > I think like all other read operations, it is driven by input format used, > and I think some variation of combine file input format is used by default. > I think you can test it by force a particular input format which gets ine > file per split, then you should end up with same number of partitions as > your dsta files > > On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark....@gmail.com> > wrote: > >> Hey all, >> >> A reply on this would be great! >> >> Thanks, >> A.B. >> >> On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegm...@securityscorecard.io> >> wrote: >> >>> When using spark.read on a large number of small files, these are >>> automatically coalesced into fewer partitions. The only documentation I can >>> find on this is in the Spark 2.0.0 release notes, where it simply says ( >>> http://spark.apache.org/releases/spark-release-2-0-0.html): >>> >>> "Automatic file coalescing for native data sources" >>> >>> Can anyone point me to documentation explaining what triggers this >>> feature, how it decides how many partitions to coalesce to, and what counts >>> as a "native data source"? I couldn't find any mention of this feature in >>> the SQL Programming Guide and Google was not helpful. >>> >>> -- >>> Daniel Siegmann >>> Senior Software Engineer >>> *SecurityScorecard Inc.* >>> 214 W 29th Street, 5th Floor >>> New York, NY 10001 >>> >>> -- > Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro