subject:"Splitting RDD to exact number of partitions"

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ovidiu-Cristian MARCU

Hi Ted, Any chance to develop more on the SQLConf parameters in the sense to have more explanations for changing these settings? Not all of them are made clear in the descriptions. Thanks! Best, Ovidiu > On 31 May 2016, at 16:30, Ted Yu wrote: > > Maciej: > You can refer to the doc in > sql/

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro

If you don't hesitate the newest version, you try to use v2.0-preview. http://spark.apache.org/news/spark-2.0.0-preview.html There, you can control #partitions for input partitions without shuffles by two parameters below; spark.sql.files.maxPartitionBytes spark.sql.files.openCostInBytes ( Not doc

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu

Maciej: You can refer to the doc in sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala for these parameters. On Tue, May 31, 2016 at 7:27 AM, Takeshi Yamamuro wrote: > If you don't hesitate the newest version, you try to use v2.0-preview. > http://spark.apache.org/news/spark-2.0

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski

Thanks. At what conditions number of partitions can be higher than minPartitions when reading textFile? Should this be considered as unfrequent situation? To sum up - is there more efficient way to ensure exact number of partitions than following: rdd = sc.textFile("perf_test1.csv", minPartition

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro

`coalesce` without shuffling can only set fewer partitions than its parent RDD. As Ted said, you need to set true in shuffle, or you need to use `RDD#repartition`. // maropu On Tue, May 31, 2016 at 11:02 PM, Ted Yu wrote: > Value for shuffle is false by default. > > Have you tried setting it t

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski

After setting shuffle to true I get expected 128 partitions, but I'm worried about performance of such solution - especially I see that some shuffling is done because size of partitions chages: scala> sc.textFile("hdfs:///proj/dFAB_test/testdata/perf_test1.csv", minPartitions=128).coalesce(128, tr

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Ted Yu

Value for shuffle is false by default. Have you tried setting it to true ? Which Spark release are you using ? On Tue, May 31, 2016 at 6:13 AM, Maciej Sokołowski wrote: > Hello Spark users and developers. > > I read file and want to ensure that it has exact number of partitions, for > example

Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski

Hello Spark users and developers. I read file and want to ensure that it has exact number of partitions, for example 128. In documentation I found: def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] But argument here is minimal number of partitions, so I use coal

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Maciej Sokołowski

Hello Spark users and developers. I read file and want to ensure that it has exact number of partitions, for example 128. In documentation I found: def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] But argument here is minimal number of partitions, so I use coal

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

Splitting RDD to exact number of partitions

Re: Splitting RDD to exact number of partitions

9 matches

Site Navigation

Mail list logo

Footer information