Hi Ram,
I didn't include an explicit label column in my reproduction as I thought
it superfluous. However, in my original use-case, I was using a
StringIndexer, where the labels were indexed across the entire dataset
(training+validation+test). The (indexed) label column was then explicitly
provided to the OneVsRest instance.
Here's the abridged version:
val textDocuments = ??? // real data here
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("labelIndexed")
.fit(textDocuments)
val lrClassifier = new LogisticRegression()
val classifier = new OneVsRest()
.setClassifier(lrClassifier)
.setLabelCol(labelIndexer.getOutputCol)
// ...
There's an explicit reference to the label column, and when created, that
column contains all possible values of the label (it's `fit` over all
data). It looks to me like StringIndexer computes label metadata at that
point (in `transform`) and attaches it to the column. This way, I'd hope
that even once TrainValidationSplit returns a subset dataframe - which may
not contain all labels - the metadata on the column should still contain
all labels.
Does my use of StringIndexer count as "metadata", here? If so, I still see
the exception as before.
I've pushed a new example using StringIndexer to my earlier repo, so you
can see the code and issue. I'm happy to try a simpler method for
providing column metadata, if one is available.
Thanks,
David
On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <[email protected]>
wrote:
> Hi David
>
> What happens if you provide the class labels via metadata instead of
> letting OneVsRest determine the labels?
>
> Ram
>
> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <[email protected]> wrote:
>
>> Hi,
>>
>> I've run into an exception using MLlib OneVsRest with logistic regression
>> (v1.6.0, but also in previous versions).
>>
>> The issue is intermittent. When running multiclass classification with
>> K-fold cross validation, there are scenarios where the split does not
>> contain instances for every target label. In such cases, an
>> ArrayIndexOutOfBoundsException is generated.
>>
>> I've tried to reproduce the problem in a simple SBT project here:
>>
>> https://github.com/junglebarry/SparkOneVsRestTest
>>
>> I don't imagine this is typical - it first surfaced when running over a
>> dataset with some very rare classes.
>>
>> I'm happy to look into patching the code, but I first wanted to confirm
>> that the problem was real, and that I wasn't somehow misunderstanding how I
>> should be using OneVsRest.
>>
>> Any guidance would be appreciated - I'm new to the list.
>>
>> Many thanks,
>> David
>>
>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: [email protected]
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>