Re: Why are ml models repartition(1)'d in save methods?

Nick Pentreath Fri, 13 Jan 2017 10:03:24 -0800

Yup - it's because almost all model data in spark ML (model coefficients)
is "small" - i.e. Non distributed.


If you look at ALS you'll see there is no repartitioning since the factor
dataframes can be large
On Fri, 13 Jan 2017 at 19:42, Sean Owen <so...@cloudera.com> wrote:

> You're referring to code that serializes models, which are quite small.
> For example a PCA model consists of a few principal component vector. It's
> a Dataset of just one element being saved here. It's re-using the code path
> normally used to save big data sets, to output 1 file with 1 thing as
> Parquet.
>
> On Fri, Jan 13, 2017 at 5:29 PM Asher Krim <ak...@hubspot.com> wrote:
>
> But why is that beneficial? The data is supposedly quite large,
> distributing it across many partitions/files would seem to make sense.
>
> On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>
> That is usually so the result comes out in one file, not partitioned over
> n files.
>
> On Fri, Jan 13, 2017 at 5:23 PM Asher Krim <ak...@hubspot.com> wrote:
>
> Hi,
>
> I'm curious why it's common for data to be repartitioned to 1 partition
> when saving ml models:
>
> sqlContext.createDataFrame(Seq(data)).repartition(1
> ).write.parquet(dataPath)
>
> This shows up in most ml models I've seen (Word2Vec
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L314>,
> PCA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L189>,
> LDA
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L605>).
> Am I missing some benefit of repartitioning like this?
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>
>

Re: Why are ml models repartition(1)'d in save methods?

Reply via email to