Re: General question on using StringIndexer in SparkML

Vishnu Viswanath Wed, 02 Dec 2015 18:16:09 -0800

Thank you.

On Wed, Dec 2, 2015 at 8:12 PM, Yanbo Liang <[email protected]> wrote:


> You can get 1.6.0-RC1 from
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
> currently, but it's not the last release version.
>
> 2015-12-02 23:57 GMT+08:00 Vishnu Viswanath <[email protected]>
> :
>
>> Thank you Yanbo,
>>
>> It looks like this is available in 1.6 version only.
>> Can you tell me how/when can I download version 1.6?
>>
>> Thanks and Regards,
>> Vishnu Viswanath,
>>
>> On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <[email protected]> wrote:
>>
>>> You can set "handleInvalid" to "skip" which help you skip the labels
>>> which not exist in training dataset.
>>>
>>> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <
>>> [email protected]>:
>>>
>>>> Hi Jeff,
>>>>
>>>> I went through the link you provided and I could understand how the
>>>> fit() and transform() work.
>>>> I tried to use the pipeline in my code and I am getting exception  Caused
>>>> by: org.apache.spark.SparkException: Unseen label:
>>>>
>>>> The reason for this error as per my understanding is:
>>>> For the column on which I am doing StringIndexing, the test data is
>>>> having values which was not there in train data.
>>>> Since fit() is done only on the train data, the indexing is failing.
>>>>
>>>> Can you suggest me what can be done in this situation.
>>>>
>>>> Thanks,
>>>>
>>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath <
>>>> [email protected]> wrote:
>>>>
>>>> Thank you Jeff.
>>>>>
>>>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <[email protected]> wrote:
>>>>>
>>>>>> StringIndexer is an estimator which would train a model to be used
>>>>>> both in training & prediction. So it is consistent between training &
>>>>>> prediction.
>>>>>>
>>>>>> You may want to read this section of spark ml doc
>>>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the reply Yanbo.
>>>>>>>
>>>>>>> I understand that the model will be trained using the indexer map
>>>>>>> created during the training stage.
>>>>>>>
>>>>>>> But since I am getting a new set of data during prediction, and I
>>>>>>> have to do StringIndexing on the new data also,
>>>>>>> Right now I am using a new StringIndexer for this purpose, or is
>>>>>>> there any way that I can reuse the Indexer used for training stage.
>>>>>>>
>>>>>>> Note: I am having a pipeline with StringIndexer in it, and I am
>>>>>>> fitting my train data in it and building the model. Then later when i 
>>>>>>> get
>>>>>>> the new data for prediction, I am using the same pipeline to fit the 
>>>>>>> data
>>>>>>> again and do the prediction.
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Vishnu Viswanath
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Vishnu,
>>>>>>>>
>>>>>>>> The string and indexer map is generated at model training step and
>>>>>>>> used at model prediction step.
>>>>>>>> It means that the string and indexer map will not changed when
>>>>>>>> prediction. You will use the original trained model when you do
>>>>>>>> prediction.
>>>>>>>>
>>>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <
>>>>>>>> [email protected]>:
>>>>>>>> > Hi All,
>>>>>>>> >
>>>>>>>> > I have a general question on using StringIndexer.
>>>>>>>> > StringIndexer gives an index to each label in the feature
>>>>>>>> starting from 0 (
>>>>>>>> > 0 for least frequent word).
>>>>>>>> >
>>>>>>>> > Suppose I am building a model, and I use StringIndexer for
>>>>>>>> transforming on
>>>>>>>> > of my column.
>>>>>>>> > e.g., suppose A was most frequent word followed by B and C.
>>>>>>>> >
>>>>>>>> > So the StringIndexer will generate
>>>>>>>> >
>>>>>>>> > A  0.0
>>>>>>>> > B  1.0
>>>>>>>> > C  2.0
>>>>>>>> >
>>>>>>>> > After building the model, I am going to do some prediction using
>>>>>>>> this model,
>>>>>>>> > So I do the same transformation on my new data which I need to
>>>>>>>> predict. And
>>>>>>>> > suppose the new dataset has C as the most frequent word, followed
>>>>>>>> by B and
>>>>>>>> > A. So the StringIndexer will assign index as
>>>>>>>> >
>>>>>>>> > C 0.0
>>>>>>>> > B 1.0
>>>>>>>> > A 2.0
>>>>>>>> >
>>>>>>>> > These indexes are different from what we used for modeling. So
>>>>>>>> won’t this
>>>>>>>> > give me a wrong prediction if I use StringIndexer?
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>
>>>
>>>
>>
>


-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*

Re: General question on using StringIndexer in SparkML

Reply via email to