Hi Ram,

I didn't include an explicit label column in my reproduction as I thought
it superfluous.  However, in my original use-case, I was using a
StringIndexer, where the labels were indexed across the entire dataset
(training+validation+test).  The (indexed) label column was then explicitly
provided to the OneVsRest instance.

Here's the abridged version:

val textDocuments = ??? // real data here

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("labelIndexed")
  .fit(textDocuments)

val lrClassifier = new LogisticRegression()

val classifier = new OneVsRest()
  .setClassifier(lrClassifier)
  .setLabelCol(labelIndexer.getOutputCol)

// ...


There's an explicit reference to the label column, and when created, that
column contains all possible values of the label (it's `fit` over all
data).  It looks to me like StringIndexer computes label metadata at that
point (in `transform`) and attaches it to the column.  This way, I'd hope
that even once TrainValidationSplit returns a subset dataframe - which may
not contain all labels - the metadata on the column should still contain
all labels.

Does my use of StringIndexer count as "metadata", here?  If so, I still see
the exception as before.

I've pushed a new example using StringIndexer to my earlier repo, so you
can see the code and issue.  I'm happy to try a simpler method for
providing column metadata, if one is available.

Thanks,
David

On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha <sriharsha....@gmail.com>
wrote:

> Hi David
>
> What happens if you provide the class labels via metadata instead of
> letting OneVsRest determine the labels?
>
> Ram
>
> On Mon, Jan 25, 2016 at 3:06 PM, David Brooks <da...@whisk.co.uk> wrote:
>
>> Hi,
>>
>> I've run into an exception using MLlib OneVsRest with logistic regression
>> (v1.6.0, but also in previous versions).
>>
>> The issue is intermittent.  When running multiclass classification with
>> K-fold cross validation, there are scenarios where the split does not
>> contain instances for every target label.  In such cases, an
>> ArrayIndexOutOfBoundsException is generated.
>>
>> I've tried to reproduce the problem in a simple SBT project here:
>>
>>    https://github.com/junglebarry/SparkOneVsRestTest
>>
>> I don't imagine this is typical - it first surfaced when running over a
>> dataset with some very rare classes.
>>
>> I'm happy to look into patching the code, but I first wanted to confirm
>> that the problem was real, and that I wasn't somehow misunderstanding how I
>> should be using OneVsRest.
>>
>> Any guidance would be appreciated - I'm new to the list.
>>
>> Many thanks,
>> David
>>
>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: har...@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
> <https://www.linkedin.com/in/harsha340> <https://twitter.com/halfabrane>
> <https://github.com/harsha2010/>
>
>

Reply via email to