Hi Gabor,

thanks for getting involved in Flink's ML library. Always good to have
people working on it :-)

Some thoughts concerning the points you've raised inline:

On Tue, Oct 4, 2016 at 12:47 PM, Gábor Hermann <m...@gaborhermann.com>
wrote:

> Hey all,
>
> We've been working on improvements for the recommendation in Flink ML, and
> some API design questions have come up. Our plans in short:
>
> - Extend ALS to work on implicit feedback datasets [1]
> - DSGD implementation for matrix factorization [2]
>
Have looked at GradientDescent? Maybe this is already doing what you want
to implement or can be adapted to do it.


> - Ranking prediction based on a matrix factorization model [3]

- Evaluations for recommenders (precision, recall, nDCG) [4]
>
>
> First, we've seen that an evaluation framework has been implemented (in a
> not yet merged PR [5]), but evalations of recommenders would not fit into
> this framework. This is basically because recommender evaluations, instead
> of comparing real numbers or fixed size vectors, compare top lists of
> possible different, arbitrary large sizes. The details are descirbed in
> FLINK-4713 [4]. I see three possible solutions for this:
>
> - we either rework the evaluation framework proposed in [5] to allow
> inputs suitable for recommender evaluations
> - or fit the recommender evaluations in the framework in a kind of
> unnatural form with possible bad performance implications
> - or do not fit recommender evaluations in the framework at all
>
> I would prefer reworking the evaluation framework, but it's up to
> discussion. It also depends on whether the PR will be merged soon or not.
> Theodore, what are your thoughts on this as the author of the eval
> framework?
>
It would be great if the evaluation framework in the PR could be adapted to
also support evaluating recommenders, if there is no fundamental reason not
to do it. But my gut feeling is that it should not be impossible.


>
> Second, picking the form of evaluation also affects how we should give the
> ranking prediction. We could choose a flat form (i.e.
> DataSet[(Int,Int,Int)]) or represent the rankings in an array (i.e.
> DataSet[(Int,Array[Int])]). See details in [4]. The flat form would allow
> the system to work distributedly, so I'd go with that representation, but
> it's also up to discussion.
>
It would be great to keep scalability in mind. Thus, I would go with the
more scalable version.


>
>
> Last, ALS and DSGD are two different algorithms for training the same
> matrix factorization model, but in the current API could not be really
> visible to the user. Training an ALS model modifies the ALS object and puts
> a matrix factorization model in it. We could do the same with DSGD and have
> a common abstraction (say a superclass MatrixFactorization). However, in my
> opinion, it might be more straightforward if ALS.fit would return a
> different object (say MatrixFactorizationModel akin to Spark [6])
> containing the DataSets representing the factors. By using this approach,
> we could avoid checking at runtime whether a model has been trained or not,
> and force the user at compile time to only call predict on models that have
> already been trained.
>
I'm not sure whether the latter approach plays well along with the
pipelining. One always has to keep in mind that ones abstraction also
should work in a ML pipeline.

I think it would be good to implement a ScoreMatrixFactorizationRecommender
and a RankingMatrixFactorizationRecommender which both work on a
MatrixFactorizationModel. This model can then either be computed by ALS or
DSGD. This could be controlled by a configuration parameter of the
recommenders.


> Of course, this could also be applied to other models in Flink ML, and
> would be an API breaking change. Were there any reason to pick the current
> training API design instead of the more "typesafe" one? I am certain, that
> we should keep the ML API consistent, so we should either change the
> training API of all models, or leave them as they ar. Although, I don't
> think it would take much effort to modify the API. We could also keep and
> depricate the current fit method to avoid breaking the API. What do you
> think about this? If there are no objections, I'm happy to open a JIRA and
> start working on it.
>
What do you mean with more "typesafe"? I don't see how returning the
trained model from the fit method gives you more type safety.

The reason why fit does not return a model was that not every estimator has
necessarily a model it trains (see PolynomialFeatures extractor).
Furthermore, when creating pipelines you basically also have to create
chained models. This is doable, no question, but you can also retrieve the
models from the modelful estimators as it is currently implemented.
Furthermore, the model itself is rarely useful without the respective
prediction algorithm.

But if you need to change the API, then we can try to figure out how we can
do this without breaking the pipelining mechanism.

>
>
> [1] https://github.com/apache/flink/pull/2542
> [2] http://dx.doi.org/10.1145/2020408.2020426
> [3] https://issues.apache.org/jira/browse/FLINK-4712
> [4] https://issues.apache.org/jira/browse/FLINK-4713
> [5] https://github.com/apache/flink/pull/1849
> [6] https://github.com/apache/spark/blob/master/mllib/src/main/
> scala/org/apache/spark/mllib/recommendation/ALS.scala#L315
>
> Cheers,
> Gabor
>
>
>
Cheers,
Till

Reply via email to