Thank you. On Wed, Dec 2, 2015 at 8:12 PM, Yanbo Liang <[email protected]> wrote:
> You can get 1.6.0-RC1 from > http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/ > currently, but it's not the last release version. > > 2015-12-02 23:57 GMT+08:00 Vishnu Viswanath <[email protected]> > : > >> Thank you Yanbo, >> >> It looks like this is available in 1.6 version only. >> Can you tell me how/when can I download version 1.6? >> >> Thanks and Regards, >> Vishnu Viswanath, >> >> On Wed, Dec 2, 2015 at 4:37 AM, Yanbo Liang <[email protected]> wrote: >> >>> You can set "handleInvalid" to "skip" which help you skip the labels >>> which not exist in training dataset. >>> >>> 2015-12-02 14:31 GMT+08:00 Vishnu Viswanath < >>> [email protected]>: >>> >>>> Hi Jeff, >>>> >>>> I went through the link you provided and I could understand how the >>>> fit() and transform() work. >>>> I tried to use the pipeline in my code and I am getting exception Caused >>>> by: org.apache.spark.SparkException: Unseen label: >>>> >>>> The reason for this error as per my understanding is: >>>> For the column on which I am doing StringIndexing, the test data is >>>> having values which was not there in train data. >>>> Since fit() is done only on the train data, the indexing is failing. >>>> >>>> Can you suggest me what can be done in this situation. >>>> >>>> Thanks, >>>> >>>> On Mon, Nov 30, 2015 at 12:32 AM, Vishnu Viswanath < >>>> [email protected]> wrote: >>>> >>>> Thank you Jeff. >>>>> >>>>> On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang <[email protected]> wrote: >>>>> >>>>>> StringIndexer is an estimator which would train a model to be used >>>>>> both in training & prediction. So it is consistent between training & >>>>>> prediction. >>>>>> >>>>>> You may want to read this section of spark ml doc >>>>>> http://spark.apache.org/docs/latest/ml-guide.html#how-it-works >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Nov 30, 2015 at 12:52 AM, Vishnu Viswanath < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Thanks for the reply Yanbo. >>>>>>> >>>>>>> I understand that the model will be trained using the indexer map >>>>>>> created during the training stage. >>>>>>> >>>>>>> But since I am getting a new set of data during prediction, and I >>>>>>> have to do StringIndexing on the new data also, >>>>>>> Right now I am using a new StringIndexer for this purpose, or is >>>>>>> there any way that I can reuse the Indexer used for training stage. >>>>>>> >>>>>>> Note: I am having a pipeline with StringIndexer in it, and I am >>>>>>> fitting my train data in it and building the model. Then later when i >>>>>>> get >>>>>>> the new data for prediction, I am using the same pipeline to fit the >>>>>>> data >>>>>>> again and do the prediction. >>>>>>> >>>>>>> Thanks and Regards, >>>>>>> Vishnu Viswanath >>>>>>> >>>>>>> >>>>>>> On Sun, Nov 29, 2015 at 8:14 AM, Yanbo Liang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Vishnu, >>>>>>>> >>>>>>>> The string and indexer map is generated at model training step and >>>>>>>> used at model prediction step. >>>>>>>> It means that the string and indexer map will not changed when >>>>>>>> prediction. You will use the original trained model when you do >>>>>>>> prediction. >>>>>>>> >>>>>>>> 2015-11-29 4:33 GMT+08:00 Vishnu Viswanath < >>>>>>>> [email protected]>: >>>>>>>> > Hi All, >>>>>>>> > >>>>>>>> > I have a general question on using StringIndexer. >>>>>>>> > StringIndexer gives an index to each label in the feature >>>>>>>> starting from 0 ( >>>>>>>> > 0 for least frequent word). >>>>>>>> > >>>>>>>> > Suppose I am building a model, and I use StringIndexer for >>>>>>>> transforming on >>>>>>>> > of my column. >>>>>>>> > e.g., suppose A was most frequent word followed by B and C. >>>>>>>> > >>>>>>>> > So the StringIndexer will generate >>>>>>>> > >>>>>>>> > A 0.0 >>>>>>>> > B 1.0 >>>>>>>> > C 2.0 >>>>>>>> > >>>>>>>> > After building the model, I am going to do some prediction using >>>>>>>> this model, >>>>>>>> > So I do the same transformation on my new data which I need to >>>>>>>> predict. And >>>>>>>> > suppose the new dataset has C as the most frequent word, followed >>>>>>>> by B and >>>>>>>> > A. So the StringIndexer will assign index as >>>>>>>> > >>>>>>>> > C 0.0 >>>>>>>> > B 1.0 >>>>>>>> > A 2.0 >>>>>>>> > >>>>>>>> > These indexes are different from what we used for modeling. So >>>>>>>> won’t this >>>>>>>> > give me a wrong prediction if I use StringIndexer? >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> >>>>>> Jeff Zhang >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >> > -- Thanks and Regards, Vishnu Viswanath, *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
