Re: Multilabel Classification in spark

Peter Garbers Wed, 06 May 2015 14:13:50 -0700

Thanks for all your feedback!

I'm a little new to scala/spark so hopefully you'll bare with me while
I try to explain how I plan to go about this and give me advice as to
why this may or may not work. My terminology may be a little incorrect
as well.
Any feedback would be greatly appreciated. Even if it's just to tell
me I'm doing it wrong :)


Below is a multilabelclassification implementation attempt using a
RandomForest Algorithm. Hopefully you'll bare with the semi pseudocode
below.
If that is unreadable you can find the code here: https://www.refheap.com/100596

def magicMultiLabelClassification(): Unit = {
  val conf = new SparkConf().setAppName("Simple
Application").setMaster("local[2]")
  val sc = new SparkContext(conf)
  val data = MLUtils.loadLibSVMFile(sc, "moo.txt")
  val splits = data.randomSplit(Array(0.7, 0.3))
  val (trainingData, testData) = (splits(0), splits(1))

  //Inputs for random forest
  val numClasses = 1 // Because I'm only training one class at a time.
  val categoricalFeaturesInfo = Map[Int, Int]()
  val numTrees = 3 // Use more in practice.
  val featureSubsetStrategy = "auto"
  val impurity = "gini"
  val maxDepth = 4
  val maxBins = 32


  val groupedData = trainingData.groupBy(d => d.label) // This will
give me data for each item I'm trying to classify in it's own
collection

  // I will need to then iterate over each key and get the values for
each label (They should be Labeled points)
  // I will then need to train individual models for each class
(hopefully I'm correct in calling it this here).
  // I will do this by mapping over each key and getting the RDD with
the labeledPoints and apply them to the model. See below.
  val models = groupedData.map{d =>
                 val trainingDataForClass = d._2 //Returns
org.apache.spark.rdd.RDD[Iterable[org.apache.spark.mllib.regression.LabeledPoint]]
= MapPartitionsRDD[57]
                                                 //as opposed to
org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint]
= PartitionwiseSampledRDD[53]
                                                 //so not very
confident it will work.
                 RandomForest.trainClassifier(trainingDataForClass,
numClasses, categoricalFeaturesInfo,
                   numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
               }

  // From here I'm not entirely sure how to do it. My initial idea is
to run predict on each of the models and take only those with a good
enough score.
  testData.map { point =>
    models.map{ m => (point.label, m.predict(point.features))}}
}

On Tue, May 5, 2015 at 9:27 PM, Ulanov, Alexander
<alexander.ula...@hp.com> wrote:
> If you are interested in multilabel (not multiclass), you might want to take
> a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is
> supposed to perform one-versus-all transformation on classes, which is
> usually how multilabel classifiers are built.
>
>
>
> Alexander
>
>
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Tuesday, May 05, 2015 3:44 PM
> To: DB Tsai
> Cc: peterg; user@spark.apache.org
> Subject: Re: Multilabel Classification in spark
>
>
>
> If you mean "multilabel" (predicting multiple label values), then MLlib does
> not yet support that.  You would need to predict each label separately.
>
>
>
> If you mean "multiclass" (1 label taking >2 categorical values), then MLlib
> supports it via LogisticRegression (as DB said), as well as DecisionTree and
> RandomForest.
>
>
>
> Joseph
>
>
>
> On Tue, May 5, 2015 at 1:27 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>
> LogisticRegression in MLlib package supports multilable classification.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> Blog: https://www.dbtsai.com
>
>
>
> On Tue, May 5, 2015 at 1:13 PM, peterg <pe...@garbers.me> wrote:
>> Hi all,
>>
>> I'm looking to implement a Multilabel classification algorithm but I am
>> surprised to find that there are not any in the spark-mllib core library.
>> Am
>> I missing something? Would someone point me in the right direction?
>>
>> Thanks!
>>
>> Peter
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multilabel Classification in spark

Reply via email to