Thank you Jeff. On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote:
> StringIndexer is an estimator which would train a model to be used both in > training & prediction. So it is consistent between training & prediction. > > You may want to read this section of spark ml doc > http://spark.apache.org/docs/latest/ml-guide.html#how-it-works > > > > On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > >> Thanks for the reply Yanbo. >> >> I understand that the model will be trained using the indexer map created >> during the training stage. >> >> But since I am getting a new set of data during prediction, and I have to >> do StringIndexing on the new data also, >> Right now I am using a new StringIndexer for this purpose, or is there >> any way that I can reuse the Indexer used for training stage. >> >> Note: I am having a pipeline with StringIndexer in it, and I am fitting >> my train data in it and building the model. Then later when i get the new >> data for prediction, I am using the same pipeline to fit the data again and >> do the prediction. >> >> Thanks and Regards, >> Vishnu Viswanath >> >> >> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> wrote: >> >>> Hi Vishnu, >>> >>> The string and indexer map is generated at model training step and >>> used at model prediction step. >>> It means that the string and indexer map will not changed when >>> prediction. You will use the original trained model when you do >>> prediction. >>> >>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com >>> >: >>> > Hi All, >>> > >>> > I have a general question on using StringIndexer. >>> > StringIndexer gives an index to each label in the feature starting >>> from 0 ( >>> > 0 for least frequent word). >>> > >>> > Suppose I am building a model, and I use StringIndexer for >>> transforming on >>> > of my column. >>> > e.g., suppose A was most frequent word followed by B and C. >>> > >>> > So the StringIndexer will generate >>> > >>> > A 0.0 >>> > B 1.0 >>> > C 2.0 >>> > >>> > After building the model, I am going to do some prediction using this >>> model, >>> > So I do the same transformation on my new data which I need to >>> predict. And >>> > suppose the new dataset has C as the most frequent word, followed by B >>> and >>> > A. So the StringIndexer will assign index as >>> > >>> > C 0.0 >>> > B 1.0 >>> > A 2.0 >>> > >>> > These indexes are different from what we used for modeling. So won’t >>> this >>> > give me a wrong prediction if I use StringIndexer? >>> > >>> > -- >>> > Thanks and Regards, >>> > Vishnu Viswanath, >>> > www.vishnuviswanath.com >>> >> >> >> >> -- >> Thanks and Regards, >> Vishnu Viswanath, >> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>* >> > > > > -- > Best Regards > > Jeff Zhang >