Hi Ram, I didn't include an explicit label column in my reproduction as I thought it superfluous. However, in my original use-case, I was using a StringIndexer, where the labels were indexed across the entire dataset (training+validation+test). The (indexed) label column was then explicitly provided to the OneVsRest instance.
Here's the abridged version: val textDocuments = ??? // real data here // Index labels, adding metadata to the label column. // Fit on whole dataset to include all labels in index. val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("labelIndexed") .fit(textDocuments) val lrClassifier = new LogisticRegression() val classifier = new OneVsRest() .setClassifier(lrClassifier) .setLabelCol(labelIndexer.getOutputCol) // ... There's an explicit reference to the label column, and when created, that column contains all possible values of the label (it's `fit` over all data). It looks to me like StringIndexer computes label metadata at that point (in `transform`) and attaches it to the column. This way, I'd hope that even once TrainValidationSplit returns a subset dataframe - which may not contain all labels - the metadata on the column should still contain all labels. Does my use of StringIndexer count as "metadata", here? If so, I still see the exception as before. I've pushed a new example using StringIndexer to my earlier repo, so you can see the code and issue. I'm happy to try a simpler method for providing column metadata, if one is available. Thanks, David On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sriharsha....@gmail.com> wrote: > Hi David > > What happens if you provide the class labels via metadata instead of > letting OneVsRest determine the labels? > > Ram > > On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote: > >> Hi, >> >> I've run into an exception using MLlib OneVsRest with logistic regression >> (v1.6.0, but also in previous versions). >> >> The issue is intermittent. When running multiclass classification with >> K-fold cross validation, there are scenarios where the split does not >> contain instances for every target label. In such cases, an >> ArrayIndexOutOfBoundsException is generated. >> >> I've tried to reproduce the problem in a simple SBT project here: >> >> https://github.com/junglebarry/SparkOneVsRestTest >> >> I don't imagine this is typical - it first surfaced when running over a >> dataset with some very rare classes. >> >> I'm happy to look into patching the code, but I first wanted to confirm >> that the problem was real, and that I wasn't somehow misunderstanding how I >> should be using OneVsRest. >> >> Any guidance would be appreciated - I'm new to the list. >> >> Many thanks, >> David >> > > > > -- > Ram Sriharsha > Architect, Spark and Data Science > Hortonworks, 2550 Great America Way, 2nd Floor > Santa Clara, CA 95054 > Ph: 408-510-8635 > email: har...@apache.org > > [image: https://www.linkedin.com/in/harsha340] > <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane> > <https://github.com/harsha2010/> > >