As far as I see from the code pointed, the default number of bytes to pack in a partition is set to 128MB - size of the parquet block size.
Daniel,
It seems you do have a need to modify the number of bytes you want to pack per partition. I am curious to know the scenario. Please share if you can.
Thanks,
Kabeer.
On May 20 2017, at 4:54 pm, Takeshi Yamamuro <linguin....@gmail.com> wrote:
I think this document points to a logic here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418This logic merge small files into a partition and you can control this threshold via `spark.sql.files.maxPartitionBytes`.// maropuOn Sat, May 20, 2017 at 8:15 AM, ayan guha <guha.a...@gmail.com> wrote:I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as your dsta filesOn Sat, 20 May 2017 at 5:12 am, Aakash Basu <aakash.spark....@gmail.com> wrote:Hey all,A reply on this would be great!Thanks,A.B.On 17-May-2017 1:43 AM, "Daniel Siegmann" <dsiegmann@securityscorecard.io > wrote:When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says (http://spark.apache.org/Can anyone point me to documentation explaining what triggers this feature, how it decides how many partitions to coalesce to, and what counts as a "native data source"? I couldn't find any mention of this feature in the SQL Programming Guide and Google was not helpful.releases/spark-release-2-0-0. ):html
"Automatic file coalescing for native data sources"Senior Software Engineer--
Daniel SiegmannSecurityScorecard Inc.214 W 29th Street, 5th FloorNew York, NY 10001--Best Regards,
Ayan Guha-----
Takeshi Yamamuro