Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed.
If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote: > You're referring to code that serializes models, which are quite small. > For example a PCA model consists of a few principal component vector. It's > a Dataset of just one element being saved here. It's re-using the code path > normally used to save big data sets, to output 1 file with 1 thing as > Parquet. > > On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote: > > But why is that beneficial? The data is supposedly quite large, > distributing it across many partitions/files would seem to make sense. > > On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote: > > That is usually so the result comes out in one file, not partitioned over > n files. > > On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote: > > Hi, > > I'm curious why it's common for data to be repartitioned to 1 partition > when saving ml models: > > sqlContext.createDataFrame(Seq(data)).repartition(1 > ).write.parquet(dataPath) > > This shows up in most ml models I've seen (Word2Vec > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>, > PCA > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>, > LDA > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>). > Am I missing some benefit of repartitioning like this? > > Thanks, > -- > Asher Krim > Senior Software Engineer > > > > > -- > Asher Krim > Senior Software Engineer > >