Thank you Hao; that was a fantastic response. I have raised SPARK-9782 for this.
I also would love to have dynamic partitioning. I mentioned it in the Jira. On 12 Aug 2015 02:19, "Cheng, Hao" <hao.ch...@intel.com> wrote: > That's a good question, we don't support reading small files in a single > partition yet, but it's definitely an issue we need to optimize, do you > mind to create a jira issue for this? Hopefully we can merge that in 1.6 > release. > > 200 is the default partition number for parallel tasks after the data > shuffle, and we have to change that value according to the file size, > cluster size etc.. > > Ideally, this number would be set dynamically and automatically, however, > spark sql doesn't support the complex cost based model yet, particularly > for the multi-stages job. ( > https://issues.apache.org/jira/browse/SPARK-4630) > > Hao > > -----Original Message----- > From: Al M [mailto:alasdair.mcbr...@gmail.com] > Sent: Tuesday, August 11, 2015 11:31 PM > To: user@spark.apache.org > Subject: Spark DataFrames uses too many partition > > I am using DataFrames with Spark 1.4.1. I really like DataFrames but the > partitioning makes no sense to me. > > I am loading lots of very small files and joining them together. Every > file is loaded by Spark with just one partition. Each time I join two > small files the partition count increases to 200. This makes my > application take 10x as long as if I coalesce everything to 1 partition > after each join. > > With normal RDDs it would not expand out the partitions to 200 after > joining two files with one partition each. It would either keep it at one > or expand it to two. > > Why do DataFrames expand out the partitions so much? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >