You can set "handleInvalid" to "skip" which help you skip the labels which not exist in training dataset.
2015-12-02 14:31 GMT+08:00 Vishnu Viswanath <vishnu.viswanat...@gmail.com>: > Hi Jeff, > > I went through the link you provided and I could understand how the fit() > and transform() work. > I tried to use the pipeline in my code and I am getting exception Caused > by: org.apache.spark.SparkException: Unseen label: > > The reason for this error as per my understanding is: > For the column on which I am doing StringIndexing, the test data is having > values which was not there in train data. > Since fit() is done only on the train data, the indexing is failing. > > Can you suggest me what can be done in this situation. > > Thanks, > > On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > > Thank you Jeff. >> >> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> StringIndexer is an estimator which would train a model to be used both >>> in training & prediction. So it is consistent between training & prediction. >>> >>> You may want to read this section of spark ml doc >>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works >>> >>> >>> >>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath < >>> vishnu.viswanat...@gmail.com> wrote: >>> >>>> Thanks for the reply Yanbo. >>>> >>>> I understand that the model will be trained using the indexer map >>>> created during the training stage. >>>> >>>> But since I am getting a new set of data during prediction, and I have >>>> to do StringIndexing on the new data also, >>>> Right now I am using a new StringIndexer for this purpose, or is there >>>> any way that I can reuse the Indexer used for training stage. >>>> >>>> Note: I am having a pipeline with StringIndexer in it, and I am fitting >>>> my train data in it and building the model. Then later when i get the new >>>> data for prediction, I am using the same pipeline to fit the data again and >>>> do the prediction. >>>> >>>> Thanks and Regards, >>>> Vishnu Viswanath >>>> >>>> >>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <yblia...@gmail.com> >>>> wrote: >>>> >>>>> Hi Vishnu, >>>>> >>>>> The string and indexer map is generated at model training step and >>>>> used at model prediction step. >>>>> It means that the string and indexer map will not changed when >>>>> prediction. You will use the original trained model when you do >>>>> prediction. >>>>> >>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath < >>>>> vishnu.viswanat...@gmail.com>: >>>>> > Hi All, >>>>> > >>>>> > I have a general question on using StringIndexer. >>>>> > StringIndexer gives an index to each label in the feature starting >>>>> from 0 ( >>>>> > 0 for least frequent word). >>>>> > >>>>> > Suppose I am building a model, and I use StringIndexer for >>>>> transforming on >>>>> > of my column. >>>>> > e.g., suppose A was most frequent word followed by B and C. >>>>> > >>>>> > So the StringIndexer will generate >>>>> > >>>>> > A 0.0 >>>>> > B 1.0 >>>>> > C 2.0 >>>>> > >>>>> > After building the model, I am going to do some prediction using >>>>> this model, >>>>> > So I do the same transformation on my new data which I need to >>>>> predict. And >>>>> > suppose the new dataset has C as the most frequent word, followed by >>>>> B and >>>>> > A. So the StringIndexer will assign index as >>>>> > >>>>> > C 0.0 >>>>> > B 1.0 >>>>> > A 2.0 >>>>> > >>>>> > These indexes are different from what we used for modeling. So won’t >>>>> this >>>>> > give me a wrong prediction if I use StringIndexer? >>>>> > >>>>> > >>>>> >>>> >>>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> >