Re: General question on using StringIndexer in SparkML

Vishnu Viswanath Wed, 02 Dec 2015 07:58:20 -0800

Thank you Yanbo,

It looks like this is available in 1.6 version only.
Can you tell me how/when can I download version 1.6?


Thanks and Regards,
Vishnu Viswanath,

On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <yblia...@gmail.com> wrote:

> You can set "handleInvalid" to "skip" which help you skip the labels which
> not exist in training dataset.
>
> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>
> :
>
>> Hi Jeff,
>>
>> I went through the link you provided and I could understand how the fit()
>> and transform() work.
>> I tried to use the pipeline in my code and I am getting exception  Caused
>> by: org.apache.spark.SparkException: Unseen label:
>>
>> The reason for this error as per my understanding is:
>> For the column on which I am doing StringIndexing, the test data is
>> having values which was not there in train data.
>> Since fit() is done only on the train data, the indexing is failing.
>>
>> Can you suggest me what can be done in this situation.
>>
>> Thanks,
>>
>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>> Thank you Jeff.
>>>
>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> StringIndexer is an estimator which would train a model to be used both
>>>> in training & prediction. So it is consistent between training & 
>>>> prediction.
>>>>
>>>> You may want to read this section of spark ml doc
>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>>
>>>>
>>>>
>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>>> vishnu.viswanat...@gmail.com> wrote:
>>>>
>>>>> Thanks for the reply Yanbo.
>>>>>
>>>>> I understand that the model will be trained using the indexer map
>>>>> created during the training stage.
>>>>>
>>>>> But since I am getting a new set of data during prediction, and I have
>>>>> to do StringIndexing on the new data also,
>>>>> Right now I am using a new StringIndexer for this purpose, or is there
>>>>> any way that I can reuse the Indexer used for training stage.
>>>>>
>>>>> Note: I am having a pipeline with StringIndexer in it, and I am
>>>>> fitting my train data in it and building the model. Then later when i get
>>>>> the new data for prediction, I am using the same pipeline to fit the data
>>>>> again and do the prediction.
>>>>>
>>>>> Thanks and Regards,
>>>>> Vishnu Viswanath
>>>>>
>>>>>
>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Vishnu,
>>>>>>
>>>>>> The string and indexer map is generated at model training step and
>>>>>> used at model prediction step.
>>>>>> It means that the string and indexer map will not changed when
>>>>>> prediction. You will use the original trained model when you do
>>>>>> prediction.
>>>>>>
>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>>> vishnu.viswanat...@gmail.com>:
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I have a general question on using StringIndexer.
>>>>>> > StringIndexer gives an index to each label in the feature starting
>>>>>> from 0 (
>>>>>> > 0 for least frequent word).
>>>>>> >
>>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>>> transforming on
>>>>>> > of my column.
>>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>>> >
>>>>>> > So the StringIndexer will generate
>>>>>> >
>>>>>> > A  0.0
>>>>>> > B  1.0
>>>>>> > C  2.0
>>>>>> >
>>>>>> > After building the model, I am going to do some prediction using
>>>>>> this model,
>>>>>> > So I do the same transformation on my new data which I need to
>>>>>> predict. And
>>>>>> > suppose the new dataset has C as the most frequent word, followed
>>>>>> by B and
>>>>>> > A. So the StringIndexer will assign index as
>>>>>> >
>>>>>> > C 0.0
>>>>>> > B 1.0
>>>>>> > A 2.0
>>>>>> >
>>>>>> > These indexes are different from what we used for modeling. So
>>>>>> won’t this
>>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> 
>>
>
>

Re: General question on using StringIndexer in SparkML

Reply via email to