Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Nick Pentreath Wed, 27 Jul 2016 02:09:07 -0700

This is exactly the core problem in the linked issue - normally you would
use the TrainValidationSplit or CrossValidator to do hyper-parameter
selection using cross-validation. You could tune the factor size,
regularization parameter and alpha (for implicit preference data), for
example.


Because of the NaN issue you cannot use the cross-validators currently with
ALS. So you would have to do it yourself manually (dropping the NaNs from
the prediction results as Krishna says).



On Mon, 25 Jul 2016 at 11:40 Rohit Chaddha <rohitchaddha1...@gmail.com>
wrote:

> Hi Krishna,
>
> Great .. I had no idea about this.  I tried your suggestion by using
> na.drop() and got a rmse = 1.5794048211812495
> Any suggestions how this can be reduced and the model improved ?
>
> Regards,
> Rohit
>
> On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar <ksanka...@gmail.com>
> wrote:
>
>> Thanks Nick. I also ran into this issue.
>> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
>> then use the dataset for the evaluator. In real life, probably detect the
>> NaN and recommend most popular on some window.
>> HTH.
>> Cheers
>> <k/>
>>
>> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> It seems likely that you're running into
>>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when
>>> the test dataset in the train/test split contains users or items that were
>>> not in the training set. Hence the model doesn't have computed factors for
>>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>>> turn results in NaN for the evaluator result.
>>>
>>> I have a PR open on that issue that will hopefully address this soon.
>>>
>>>
>>> On Sun, 24 Jul 2016 at 17:49 VG <vlin...@gmail.com> wrote:
>>>
>>>> ping. Anyone has some suggestions/advice for me .
>>>> It will be really helpful.
>>>>
>>>> VG
>>>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vlin...@gmail.com> wrote:
>>>>
>>>>> Sean,
>>>>>
>>>>> I did this just to test the model. When I do a split of my data as
>>>>> training to 80% and test to be 20%
>>>>>
>>>>> I get a Root-mean-square error = NaN
>>>>>
>>>>> So I am wondering where I might be going wrong
>>>>>
>>>>> Regards,
>>>>> VG
>>>>>
>>>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>>>> lower-rank representation of the input. It would not reproduce the
>>>>>> input exactly, and you don't want it to -- this would be seriously
>>>>>> overfit. This is why in general you don't evaluate a model on the
>>>>>> training set.
>>>>>>
>>>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vlin...@gmail.com> wrote:
>>>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>>>> >
>>>>>> > Just to test I am using the same dataset for training using
>>>>>> ALSModel and for
>>>>>> > predicting the results based on the model .
>>>>>> >
>>>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>>>> > Root-mean-square error = 1.5544064263236066
>>>>>> >
>>>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>>>> >
>>>>>> > Regards,
>>>>>> > Vipul
>>>>>>
>>>>>
>>>>>
>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Reply via email to