Re: General question on using StringIndexer in SparkML

Yanbo Liang Wed, 02 Dec 2015 02:38:17 -0800

You can set "handleInvalid" to "skip" which help you skip the labels which
not exist in training dataset.


2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>:

> Hi Jeff,
>
> I went through the link you provided and I could understand how the fit()
> and transform() work.
> I tried to use the pipeline in my code and I am getting exception  Caused
> by: org.apache.spark.SparkException: Unseen label:
>
> The reason for this error as per my understanding is:
> For the column on which I am doing StringIndexing, the test data is having
> values which was not there in train data.
> Since fit() is done only on the train data, the indexing is failing.
>
> Can you suggest me what can be done in this situation.
>
> Thanks,
>
> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
> Thank you Jeff.
>>
>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> StringIndexer is an estimator which would train a model to be used both
>>> in training & prediction. So it is consistent between training & prediction.
>>>
>>> You may want to read this section of spark ml doc
>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>
>>>
>>>
>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>>> Thanks for the reply Yanbo.
>>>>
>>>> I understand that the model will be trained using the indexer map
>>>> created during the training stage.
>>>>
>>>> But since I am getting a new set of data during prediction, and I have
>>>> to do StringIndexing on the new data also,
>>>> Right now I am using a new StringIndexer for this purpose, or is there
>>>> any way that I can reuse the Indexer used for training stage.
>>>>
>>>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>>>> my train data in it and building the model. Then later when i get the new
>>>> data for prediction, I am using the same pipeline to fit the data again and
>>>> do the prediction.
>>>>
>>>> Thanks and Regards,
>>>> Vishnu Viswanath
>>>>
>>>>
>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Vishnu,
>>>>>
>>>>> The string and indexer map is generated at model training step and
>>>>> used at model prediction step.
>>>>> It means that the string and indexer map will not changed when
>>>>> prediction. You will use the original trained model when you do
>>>>> prediction.
>>>>>
>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>> vishnu.viswanat...@gmail.com>:
>>>>> > Hi All,
>>>>> >
>>>>> > I have a general question on using StringIndexer.
>>>>> > StringIndexer gives an index to each label in the feature starting
>>>>> from 0 (
>>>>> > 0 for least frequent word).
>>>>> >
>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>> transforming on
>>>>> > of my column.
>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>> >
>>>>> > So the StringIndexer will generate
>>>>> >
>>>>> > A  0.0
>>>>> > B  1.0
>>>>> > C  2.0
>>>>> >
>>>>> > After building the model, I am going to do some prediction using
>>>>> this model,
>>>>> > So I do the same transformation on my new data which I need to
>>>>> predict. And
>>>>> > suppose the new dataset has C as the most frequent word, followed by
>>>>> B and
>>>>> > A. So the StringIndexer will assign index as
>>>>> >
>>>>> > C 0.0
>>>>> > B 1.0
>>>>> > A 2.0
>>>>> >
>>>>> > These indexes are different from what we used for modeling. So won’t
>>>>> this
>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> 
>

Re: General question on using StringIndexer in SparkML

Reply via email to