Re: General question on using StringIndexer in SparkML

Vishnu Viswanath Sun, 29 Nov 2015 22:33:07 -0800

Thank you Jeff.

On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:


> StringIndexer is an estimator which would train a model to be used both in
> training & prediction. So it is consistent between training & prediction.
>
> You may want to read this section of spark ml doc
> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works
>
>
>
> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Thanks for the reply Yanbo.
>>
>> I understand that the model will be trained using the indexer map created
>> during the training stage.
>>
>> But since I am getting a new set of data during prediction, and I have to
>> do StringIndexing on the new data also,
>> Right now I am using a new StringIndexer for this purpose, or is there
>> any way that I can reuse the Indexer used for training stage.
>>
>> Note: I am having a pipeline with StringIndexer in it, and I am fitting
>> my train data in it and building the model. Then later when i get the new
>> data for prediction, I am using the same pipeline to fit the data again and
>> do the prediction.
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>>
>>
>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> Hi Vishnu,
>>>
>>> The string and indexer map is generated at model training step and
>>> used at model prediction step.
>>> It means that the string and indexer map will not changed when
>>> prediction. You will use the original trained model when you do
>>> prediction.
>>>
>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com
>>> >:
>>> > Hi All,
>>> >
>>> > I have a general question on using StringIndexer.
>>> > StringIndexer gives an index to each label in the feature starting
>>> from 0 (
>>> > 0 for least frequent word).
>>> >
>>> > Suppose I am building a model, and I use StringIndexer for
>>> transforming on
>>> > of my column.
>>> > e.g., suppose A was most frequent word followed by B and C.
>>> >
>>> > So the StringIndexer will generate
>>> >
>>> > A  0.0
>>> > B  1.0
>>> > C  2.0
>>> >
>>> > After building the model, I am going to do some prediction using this
>>> model,
>>> > So I do the same transformation on my new data which I need to
>>> predict. And
>>> > suppose the new dataset has C as the most frequent word, followed by B
>>> and
>>> > A. So the StringIndexer will assign index as
>>> >
>>> > C 0.0
>>> > B 1.0
>>> > A 2.0
>>> >
>>> > These indexes are different from what we used for modeling. So won’t
>>> this
>>> > give me a wrong prediction if I use StringIndexer?
>>> >
>>> > --
>>> > Thanks and Regards,
>>> > Vishnu Viswanath,
>>> > www.vishnuviswanath.com
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Vishnu Viswanath,
>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: General question on using StringIndexer in SparkML

Reply via email to