But why is that beneficial? The data is supposedly quite large, distributing it across many partitions/files would seem to make sense.
On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote: > That is usually so the result comes out in one file, not partitioned over > n files. > > On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote: > >> Hi, >> >> I'm curious why it's common for data to be repartitioned to 1 partition >> when saving ml models: >> >> sqlContext.createDataFrame(Seq(data)).repartition(1).write. >> parquet(dataPath) >> >> This shows up in most ml models I've seen (Word2Vec >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>, >> PCA >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>, >> LDA >> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>). >> Am I missing some benefit of repartitioning like this? >> >> Thanks, >> -- >> Asher Krim >> Senior Software Engineer >> > -- Asher Krim Senior Software Engineer