Hi again Ram, Sorry, I was too hasty in my previous response. I've done a bit more digging through the code, and StringIndexer does indeed provide metadata, as a NominalAttribute with a known number of class labels. I don't think the issue is related to the use of metadata, however.
It seems to me to be caused by the interaction between OneVsRest and TrainValidationSplit. For rare target classes under OneVsRest, it seems quite possible for this random-split approach to select a training subset where all items belong to non-target classes - all of which are given the same class label by OneVsRest. In this case, we start training LogisticRegression on data of a single class, which seems odd. The exception stems from there. The cause looks to me to be that OneVsRest.fit runs binary classifications from 0 to numClasses (OneVsRest.scala:209), and this seems incompatible with the random split, which cannot guarantee training examples for all labels in the range. It might be preferable to iterate over the observed labels in the training set, rather than all labels in the range. I don't know the performance effects of that change, but it does look incompatible with using the label metadata as a shortcut. Do you agree that there is an issue here? Would you accept contributions to the code to remedy it? I'd gladly take a look if I can be of help. Many thanks, David On Tue, Jan 26, 2016 at 1:29 PM David Brooks <da...@whisk.co.uk> wrote: > Hi Ram, > > I didn't include an explicit label column in my reproduction as I thought > it superfluous. However, in my original use-case, I was using a > StringIndexer, where the labels were indexed across the entire dataset > (training+validation+test). The (indexed) label column was then explicitly > provided to the OneVsRest instance. > > Here's the abridged version: > > val textDocuments = ??? // real data here > > // Index labels, adding metadata to the label column. > // Fit on whole dataset to include all labels in index. > val labelIndexer = new StringIndexer() > .setInputCol("label") > .setOutputCol("labelIndexed") > .fit(textDocuments) > > val lrClassifier = new LogisticRegression() > > val classifier = new OneVsRest() > .setClassifier(lrClassifier) > .setLabelCol(labelIndexer.getOutputCol) > > // ... > > > There's an explicit reference to the label column, and when created, that > column contains all possible values of the label (it's `fit` over all > data). It looks to me like StringIndexer computes label metadata at that > point (in `transform`) and attaches it to the column. This way, I'd hope > that even once TrainValidationSplit returns a subset dataframe - which > may not contain all labels - the metadata on the column should still > contain all labels. > > Does my use of StringIndexer count as "metadata", here? If so, I still > see the exception as before. > > I've pushed a new example using StringIndexer to my earlier repo, so you > can see the code and issue. I'm happy to try a simpler method for > providing column metadata, if one is available. > > Thanks, > David > > On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sriharsha....@gmail.com> > wrote: > >> Hi David >> >> What happens if you provide the class labels via metadata instead of >> letting OneVsRest determine the labels? >> >> Ram >> >> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote: >> >>> Hi, >>> >>> I've run into an exception using MLlib OneVsRest with logistic >>> regression (v1.6.0, but also in previous versions). >>> >>> The issue is intermittent. When running multiclass classification with >>> K-fold cross validation, there are scenarios where the split does not >>> contain instances for every target label. In such cases, an >>> ArrayIndexOutOfBoundsException is generated. >>> >>> I've tried to reproduce the problem in a simple SBT project here: >>> >>> https://github.com/junglebarry/SparkOneVsRestTest >>> >>> I don't imagine this is typical - it first surfaced when running over a >>> dataset with some very rare classes. >>> >>> I'm happy to look into patching the code, but I first wanted to confirm >>> that the problem was real, and that I wasn't somehow misunderstanding how I >>> should be using OneVsRest. >>> >>> Any guidance would be appreciated - I'm new to the list. >>> >>> Many thanks, >>> David >>> >> >> >> >> -- >> Ram Sriharsha >> Architect, Spark and Data Science >> Hortonworks, 2550 Great America Way, 2nd Floor >> Santa Clara, CA 95054 >> Ph: 408-510-8635 >> email: har...@apache.org >> >> [image: https://www.linkedin.com/in/harsha340] >> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane> >> <https://github.com/harsha2010/> >> >>