But why is that beneficial? The data is supposedly quite large,
distributing it across many partitions/files would seem to make sense.

On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:

> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
>> Hi,
>>
>> I'm curious why it's common for data to be repartitioned to 1 partition
>> when saving ml models:
>>
>> sqlContext.createDataFrame(Seq(data)).repartition(1).write.
>> parquet(dataPath)
>>
>> This shows up in most ml models I've seen (Word2Vec
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
>> PCA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
>> LDA
>> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
>> Am I missing some benefit of repartitioning like this?
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>


-- 
Asher Krim
Senior Software Engineer

Reply via email to